Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/www2007/paper/main/592> ?p ?o. }
Showing items 1 to 13 of
13
with 100 items per page.
- 592 creator andrew-tomkins.
- 592 creator anirban-dasgupta.
- 592 creator arpita-ghosh.
- 592 creator christopher-olston.
- 592 creator ravi-kumar.
- 592 creator sandeep-pandey.
- 592 type InProceedings.
- 592 label "The Discoverability of the Web".
- 592 sameAs 592.
- 592 abstract "Previous studies have highlighted the rapidity with which new content arrives on the web. We study the extent to which this new content can be efficiently discovered in the crawling model. Our study has two parts. First, we employ a maximum cover formulation to study the inherent difficulty of the problem in a setting in which we have perfect estimates of likely sources of links to new content. Second, we relax the requirement of perfect estimates into a more realistic setting in which algorithms must discover new content using historical statistics to estimate which pages are most likely to yield links to new content.<br /><br /> We measure the overhead of discovering new content, defined as the average number of fetches required to discover one new page. We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 50\% of all new content with under 3\% overhead, and 100\% of new content with 28\% overhead. But actual algorithms, which do not have access to perfect foreknowledge, face a more difficult task: 26\% of new content is accessible only by recrawling a constant fraction of the entire web. Of the remaining 74\%, 80\% of this content may be discovered within one week at discovery cost equal to 1.3X the cost of gathering the new content, in a model with full monthly recrawls.".
- 592 hasAuthorList authorList.
- 592 isPartOf proceedings.
- 592 title "The Discoverability of the Web".