Data Portal @ linkeddatafragments.org

ScholarlyData

Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/lrec2008/papers/481> ?p ?o. }

Showing items 1 to 13 of 13 with 100 items per page.

481 creator jan-pomikalek.
481 creator pavel-rychly.
481 type InProceedings.
481 label "Detecting Co-Derivative Documents in Large Text Collections".
481 sameAs 481.
481 abstract "We have analyzed the SPEX algorithm by Bernstein and Zobel (2004) for detecting co-derivative documents using duplicate n-grams. Although we totally agree with the claim that not using unique n-grams can greatly increase the efficiency and scalability of the process of detecting co-derivative documents, we have found serious bottlenecks in the way SPEX finds the duplicate n-grams. While the memory requirements for computing co-derivative documents can be reduced to up to 1% by only using duplicate n-grams, SPEX needs about 40 times more memory for computing the list of duplicate n-grams itself. Therefore the memory requirements of the whole process are not reduced enough to make the algorithm practical for very large collections. We propose a solution for this problem using an external sort with the suffix array in-memory sorting and temporary file compression. The proposed algorithm for computing duplicate n-grams uses a fixed amount of memory for any input size.".
481 hasAuthorList authorList.
481 hasTopic Linguistics.
481 isPartOf proceedings.
481 keyword "Corpus (creation, annotation, etc.)".
481 keyword "Digital libraries".
481 keyword "Document Classification, Text categorisation".
481 title "Detecting Co-Derivative Documents in Large Text Collections".