Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/www2010/paper/main/796> ?p ?o. }
Showing items 1 to 14 of
14
with 100 items per page.
- 796 creator jiawei-han.
- 796 creator tim-weninger.
- 796 creator william-hsu.
- 796 type InProceedings.
- 796 label "CETR - Content Extraction via Tag Ratios".
- 796 sameAs 796.
- 796 abstract "We present Content Extraction via Tag Ratios (CETR) -- a method to extract content text from diverse Web pages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.".
- 796 hasAuthorList authorList.
- 796 isPartOf proceedings.
- 796 keyword "Negative content filtering".
- 796 keyword "porn".
- 796 keyword "spam".
- 796 keyword "viruses".
- 796 title "CETR - Content Extraction via Tag Ratios".