Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/www2010/paper/main/583> ?p ?o. }
Showing items 1 to 13 of
13
with 100 items per page.
- 583 creator lei-zhang-2.
- 583 creator rui-cai.
- 583 type InProceedings.
- 583 label "A Pattern Tree-based Approach to Learning URL Normalization Rules".
- 583 sameAs 583.
- 583 abstract "Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs to a canonical form using a set of rewrite rules. Nowadays, URL normalization is attracting significant attention as it is lightweight and can be flexibly integrated into both the online (e.g. crawling) and offline (e.g. index compression) parts of a search engine. To deal with a large scale of websites, automatic approaches are highly desired to learn rewrite rules for various kinds of duplicate URLs. In this paper, we rethink the problem of URL normalization from a global perspective and propose a pattern tree-based approach. This is remarkably different from existing approaches, which normalize URLs by iteratively inducing local duplicate pairs to a more general form, and inevitably suffer from noisy training URLs and the low efficiency problem in practical systems. Given a training set of URLs partitioned into duplicate clusters for a targeted website, we develop a simple yet efficient algorithm to automatically construct a URL pattern tree. With the constructed pattern tree, we can leverage the statistical information from all the training samples and make the learning process more robust and reliable. The learning process can also be accelerated as rules are directly summarized based on pattern tree nodes. In addition, from the engineering side, the pattern tree can help select deployable rules by removing conflictions and redundancies. A large-scale evaluation on more than 70 million duplicate URLs from 200 websites showed the proposed approach can achieve very promising performance, in terms of both de-duping effectiveness and computational efficiency.".
- 583 hasAuthorList authorList.
- 583 isPartOf proceedings.
- 583 keyword "Indexing".
- 583 keyword "caching".
- 583 keyword "distribution".
- 583 keyword "index compression".
- 583 title "A Pattern Tree-based Approach to Learning URL Normalization Rules".