Data Portal @ linkeddatafragments.org

ScholarlyData

Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/lrec2008/papers/152> ?p ?o. }

Showing items 1 to 12 of 12 with 100 items per page.

152 creator oana-frunza.
152 type InProceedings.
152 label "A Trainable Tokenizer, solution for multilingual texts and compound expression tokenization".
152 sameAs 152.
152 abstract "Tokenization is one of the initial steps done for almost any text processing task. It is not particularly recognized as a challenging task for English monolingual systems but it rapidly increases in complexity for systems that apply it for different languages. This article proposes a supervised learning approach to perform the tokenization task. The method presented in this article is based on character transitions representation, a representation that allows compound expressions to be recognized as a single token. Compound tokens are identified independent of the character that creates the expression. The method automatically learns tokenization rules from a pre-tokenized corpus. The results obtained using the trainable system show that for Romanian and English a statistical significant improvement is obtained over a baseline system that tokenizes texts on every non-alphanumeric character.".
152 hasAuthorList authorList.
152 hasTopic Linguistics.
152 isPartOf proceedings.
152 keyword "Acquisition, Machine Learning".
152 keyword "Multilinguality".
152 keyword "Text mining".
152 title "A Trainable Tokenizer, solution for multilingual texts and compound expression tokenization".