Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/www2008/paper/454> ?p ?o. }
Showing items 1 to 17 of
17
with 100 items per page.
- 454 creator jiang-ming-yang.
- 454 creator lei-zhang-2.
- 454 creator rui-cai.
- 454 creator wei-lai.
- 454 creator yida-wang.
- 454 type InProceedings.
- 454 label "iRobot: An Intelligent Crawler for Web Forums".
- 454 sameAs 454.
- 454 abstract "We study in this paper the Web forum crawling problem, which is a very fundamental step in many Web applications, such as search engine and Web data mining. As a typical user-created content (UCC), Web forum has become an important resource on the Web due to its rich information contributed by millions of Internet users every day. However, Web forum crawling is not a trivial problem due to the in-depth link structures, the large amount of duplicate pages, as well as many invalid pages caused by login failure issues. In this paper, we propose and build a prototype of an intelligent forum crawler, iRobot, which has intelligence to understand the content and the structure of a forum site, and then decide how to choose crawling routings among different kinds of pages. To do this, we first randomly sample (download) a few pages from the target forum site, and introduce the page content layout as the characteristics to group those pre-sampled pages and recover the forum's sitemap. After that, we select an optimal routing path which only traverses informative pages and skips invalid and duplicate ones. The extensive experimental results on several forums show the performance of our system in the following aspects: 1) Effectiveness – Compared to a generic crawler, iRobot significantly decreases the duplicate and invalid pages; 2) Efficiency – With a small cost of pre-sampling a few pages for learning the necessary knowledge, iRobot saves substantial network bandwidth and storage as it only fetches informative pages from a forum site; and 3) Long threads that are divided into multiple pages can be re-concatenated and archived as a whole thread, which is of great help for further indexing and data mining.".
- 454 hasAuthorList authorList.
- 454 hasTopic World_Wide_Web.
- 454 isPartOf proceedings.
- 454 keyword "Forum crawler".
- 454 keyword "repetitive pattern".
- 454 keyword "routing selection".
- 454 keyword "sitemap".
- 454 title "iRobot: An Intelligent Crawler for Web Forums".