Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/www2007/paper/main/215> ?p ?o. }
Showing items 1 to 10 of
10
with 100 items per page.
- 215 creator anish-das-sarma.
- 215 creator arvind-jain.
- 215 creator gurmeet-manku.
- 215 type InProceedings.
- 215 label "Detecting Near-Duplicates for Web Crawling".
- 215 sameAs 215.
- 215 abstract "Near-duplicate documents are commonly found on the web. A pair of near-duplicate web pages differ from each other in a very small portion. The differences commonly consist of advertisements and timestamps. Such differences are irrelevant for web search. During web crawling, it is useful to quickly ascertain whether a newly crawled web page is a near-duplicate of a previously crawled web page or not.<br /><br /> In the course of developing a practical system for near-duplicate detection, we make two research contributions. First, we demonstrate the effectiveness of Charikar's fingerprinting technique for identifying near-duplicate web pages. We show that for 8 billion web-pages, a good choice of parameters is 64-bit fingerprints and 3-bit Hamming distances. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.".
- 215 hasAuthorList authorList.
- 215 isPartOf proceedings.
- 215 title "Detecting Near-Duplicates for Web Crawling".