ScholarlyData |

ScholarlyData

Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/www2010/paper/main/221> ?p ?o. }

Showing items 1 to 13 of 13 with 100 items per page.

221 creator deepayan-chakrabarti.
221 creator rupesh-mehta.
221 type InProceedings.
221 label "The Paths More Taken: Matching DOM Trees to Search Logs for Accurate Webpage Clustering".
221 sameAs 221.
221 abstract "An unsupervised clustering of the webpages on a website is a primary requirement for most wrapper induction and automated data extraction methods. Since page content can vary drastically across pages of one cluster (e.g., all product pages on \url{amazon.com}), traditional clustering methods typically use some distance function between the DOM trees representing a pair of webpages. However, without knowing which portions of the DOM tree are ``important,'' such distance functions might discriminate between similar pages based on trivial features (e.g., differing number of reviews on two product pages), or club together distinct types of pages based on superficial features present in the DOM trees of both (e.g., matching footer/copyright), leading to poor clustering performance. We propose using search logs to automatically find paths in the DOM trees that mark out important portions of pages, e.g., the product title in a product page. Such paths are identified via a {\em global} analysis of the entire website, whereby search data for popular pages can be used to infer good paths even for other pages that receive little or no search traffic. The webpages on the website are then clustered using these ``key'' paths. Our algorithm only requires information on search queries, and the webpages clicked in response to them; there is no need for human input, and it does not need to be told which portion of a webpage the user found interesting. The resulting clusterings achieve an adjusted RAND score of over 0.9 on half of the websites (a score of 1 indicating a perfect clustering), and $59\%$ better scores on average than competing algorithms. Besides leading to refined clusterings, these key paths can be useful in the wrapper induction process itself, as shown by the high degree of match between the key paths and the manually identified paths used in existing wrappers for these sites ($90\%$ average precision).".
221 hasAuthorList authorList.
221 isPartOf proceedings.
221 keyword "Text".
221 keyword "classification".
221 keyword "metadata clustering".
221 keyword "web page".
221 title "The Paths More Taken: Matching DOM Trees to Search Logs for Accurate Webpage Clustering".