Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/lrec2008/papers/152> ?p ?o. }
Showing items 1 to 12 of
12
with 100 items per page.
- 152 creator oana-frunza.
- 152 type InProceedings.
- 152 label "A Trainable Tokenizer, solution for multilingual texts and compound expression tokenization".
- 152 sameAs 152.
- 152 abstract "Tokenization is one of the initial steps done for almost any text processing task. It is not particularly recognized as a challenging task for English monolingual systems but it rapidly increases in complexity for systems that apply it for different languages. This article proposes a supervised learning approach to perform the tokenization task. The method presented in this article is based on character transitions representation, a representation that allows compound expressions to be recognized as a single token. Compound tokens are identified independent of the character that creates the expression. The method automatically learns tokenization rules from a pre-tokenized corpus. The results obtained using the trainable system show that for Romanian and English a statistical significant improvement is obtained over a baseline system that tokenizes texts on every non-alphanumeric character.".
- 152 hasAuthorList authorList.
- 152 hasTopic Linguistics.
- 152 isPartOf proceedings.
- 152 keyword "Acquisition, Machine Learning".
- 152 keyword "Multilinguality".
- 152 keyword "Text mining".
- 152 title "A Trainable Tokenizer, solution for multilingual texts and compound expression tokenization".