Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/lrec2008/papers/662> ?p ?o. }
Showing items 1 to 14 of
14
with 100 items per page.
- 662 creator christoph-ringlstetter.
- 662 creator randy-goebel.
- 662 creator weiruo-qu.
- 662 type InProceedings.
- 662 label "Targeting Chinese Nominal Compounds in Corpora".
- 662 sameAs 662.
- 662 abstract "For compounding languages, a great part of the topical semantics is transported via nominal compounds. Various applications of natural language processing can profit from explicit access to these compounds, provided by a lexicon. The best way to acquire such a resource is to harvest corpora that represent the domain in question. For Chinese, a significant difficulty lies in the fact that the text comes as a string of characters, only segmented by sentence boundaries. Extraction algorithms that solely rely on context variety do not perform precisely enough. We propose a pipeline of filters that starts from a candidate set established by accessor variety and then employs several methods to improve precision. For the experiments the Xinhua part of the Chinese Gigaword Corpus was used. We extracted a random sample of 200 story texts with 119,509 Hanzi characters. All compound words of this evaluation corpus were tagged, segmented into their morphemes, and augmented with the POS-information of their segments. A cascade of filters applied to a preliminary set of compound candidates led to a very high precision of over 90%, measured for the types. The result also holds for a small corpus where a solely contextual method introduces too much noise, even for the longer compounds. An introduction of MI into the basic candidacy algorithm led to a much higher recall with still reasonable precision for subsequent manual processing. Especially for the four-character compounds, that in our sample represent over 40% of the target data, the method has sufficient efficacy to support the rapid construction of compound dictionaries from domain corpora.".
- 662 hasAuthorList authorList.
- 662 hasTopic Linguistics.
- 662 isPartOf proceedings.
- 662 keyword "Corpus (creation, annotation, etc.)".
- 662 keyword "MultiWord Expressions & Collocations".
- 662 keyword "Statistical methods".
- 662 title "Targeting Chinese Nominal Compounds in Corpora".