Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/lrec2008/papers/481> ?p ?o. }
Showing items 1 to 13 of
13
with 100 items per page.
- 481 creator jan-pomikalek.
- 481 creator pavel-rychly.
- 481 type InProceedings.
- 481 label "Detecting Co-Derivative Documents in Large Text Collections".
- 481 sameAs 481.
- 481 abstract "We have analyzed the SPEX algorithm by Bernstein and Zobel (2004) for detecting co-derivative documents using duplicate n-grams. Although we totally agree with the claim that not using unique n-grams can greatly increase the efficiency and scalability of the process of detecting co-derivative documents, we have found serious bottlenecks in the way SPEX finds the duplicate n-grams. While the memory requirements for computing co-derivative documents can be reduced to up to 1% by only using duplicate n-grams, SPEX needs about 40 times more memory for computing the list of duplicate n-grams itself. Therefore the memory requirements of the whole process are not reduced enough to make the algorithm practical for very large collections. We propose a solution for this problem using an external sort with the suffix array in-memory sorting and temporary file compression. The proposed algorithm for computing duplicate n-grams uses a fixed amount of memory for any input size.".
- 481 hasAuthorList authorList.
- 481 hasTopic Linguistics.
- 481 isPartOf proceedings.
- 481 keyword "Corpus (creation, annotation, etc.)".
- 481 keyword "Digital libraries".
- 481 keyword "Document Classification, Text categorisation".
- 481 title "Detecting Co-Derivative Documents in Large Text Collections".