Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/www2008/paper/865> ?p ?o. }
Showing items 1 to 14 of
14
with 100 items per page.
- 865 creator derek-leonard.
- 865 creator dmitri-loguinov.
- 865 creator hsin-tsang-lee.
- 865 type InProceedings.
- 865 label "IRLbot: Scaling to 6 Billion Pages and Beyond".
- 865 sameAs 865.
- 865 abstract "This paper shares our experience in designing web crawlers that scale to billions of pages and models their performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host rate-limiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 308 mb/s (1789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 260 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes.".
- 865 hasAuthorList authorList.
- 865 hasTopic World_Wide_Web.
- 865 isPartOf proceedings.
- 865 keyword "algorithms".
- 865 keyword "crawler".
- 865 keyword "scalability".
- 865 title "IRLbot: Scaling to 6 Billion Pages and Beyond".