Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/www2010/paper/main/398> ?p ?o. }
Showing items 1 to 10 of
10
with 100 items per page.
- 398 creator arnd-christian-koenig.
- 398 creator ping-li.
- 398 type InProceedings.
- 398 label "b-bit minwise hashing".
- 398 sameAs 398.
- 398 abstract "This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, social networks and computational advertising. By only storing the lowest b bits of each (minwise) hashed value (e.g., b=1 or 2), one can gain substantial advantages in terms of computational efficiency and storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any $b$. We demonstrate that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to using b=64 (or b=32), if one is interested in resemblance >0.5. Our theoretical results are validated using a proprietary collection of $10^6$ news articles and a public (UCI) dataset of $300,000$ NYTimes articles.".
- 398 hasAuthorList authorList.
- 398 isPartOf proceedings.
- 398 keyword "Efficient algorithms for large-scale analysis".
- 398 title "b-bit minwise hashing".