Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/wac7/data.semanticweb.org/workshop/wac7/2012/paper/4> ?p ?o. }
Showing items 1 to 14 of
14
with 100 items per page.
- 4 creator jan-pomikalek.
- 4 creator vit-suchomel.
- 4 type InProceedings.
- 4 label "Efficient Web Crawling for Large Text Corpora".
- 4 sameAs 4.
- 4 abstract "Many researchers use texts from the web, an easy source of linguistic data in a great variety of languages. Building both large and good quality text corpora is the challenge we face nowadays. In this paper we describe how to deal with inefficient data downloading and how to focus crawling on text rich web domains. The idea has been successfully implemented in SpiderLing. We present efficiency figures from crawling texts in American Spanish, Czech, Japanese, Russian, Tajik Persian, Turkish and the sizes of the resulting corpora.".
- 4 hasAuthorList authorList.
- 4 isPartOf proceedings.
- 4 keyword "corpus".
- 4 keyword "crawler".
- 4 keyword "text corpus".
- 4 keyword "web corpus".
- 4 keyword "web crawling".
- 4 title "Efficient Web Crawling for Large Text Corpora".