Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/wac7/data.semanticweb.org/workshop/wac7/2012/paper/2> ?p ?o. }
Showing items 1 to 12 of
12
with 100 items per page.
- 2 creator yana-panchenko.
- 2 creator yannick-versley.
- 2 type InProceedings.
- 2 label "Not Just Bigger: Towards Better-Quality Web Corpora".
- 2 sameAs 2.
- 2 abstract "For the acquisition of common-sense knowledge as well as as a way to answer linguistic questions regarding actual language usage, the breadth and depth of the World Wide Web has been welcomed to supplement large text corpora (usually from newspapers) as a useful resource. Previous research using Web corpora for either knowledge acquisition or information extraction has sometimes shown them to be less useful than newspaper or newswire corpora. More than a-priori criticism from corpus linguists on the difficulty of assuring balanced text composition and/or text quality does, these empirical results underline the importance of assessing and improving the quality of Web corpora to ensure their usefulness in real-world tasks. In this paper, we present our own pipeline for Web corpora, which includes improvements regarding content-sensitive boilerplate detection as well as language filtering for mixed-language documents, and provide a task-based evaluation of the combination of corpora and (non-linguistic and linguistic) preprocessing between more standard types of large corpora (newspaper and Wikipedia) and different Web corpora. While our current results are focused on German-language Web corpora, both the content-sensitive boilerplate detection and our method of evaluation by constructing an artificial thesaurus from a wordnet are applicable to many other languages.".
- 2 hasAuthorList authorList.
- 2 isPartOf proceedings.
- 2 keyword "boilerplate removal".
- 2 keyword "corpus building".
- 2 keyword "evaluation".
- 2 title "Not Just Bigger: Towards Better-Quality Web Corpora".