Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/www2008/paper/597> ?p ?o. }
Showing items 1 to 12 of
12
with 100 items per page.
- 597 creator christopher-olston.
- 597 creator sandeep-pandey.
- 597 type InProceedings.
- 597 label "Recrawl Scheduling Based on Information Longevity".
- 597 sameAs 597.
- 597 abstract "It is crucial for a web crawler to distinguish between ephemeral and persistent content. It is usually not worth crawling ephemeral content (e.g., quote of the day), because by the time it reaches the index it is no longer representative of the web page from which it was acquired. On the other hand, content that persists across multiple page updates (e.g., recent blog postings) may be worth acquiring, because it matches the page's true content for a sustained period of time. In this paper we characterize the longevity of information found on the web, via both empirical measurements and a generative model that coincides with these measurements. We then develop new recrawl scheduling policies that take longevity into account. As we show via experiments over real web data, our policies obtain better freshness at lower cost, compared with previous approaches.".
- 597 hasAuthorList authorList.
- 597 hasTopic World_Wide_Web.
- 597 isPartOf proceedings.
- 597 keyword "crawling".
- 597 keyword "information longevity".
- 597 title "Recrawl Scheduling Based on Information Longevity".