Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/www2010/paper/main/647> ?p ?o. }
Showing items 1 to 13 of
13
with 100 items per page.
- 647 creator evan-kirshenbaum.
- 647 creator george-forman.
- 647 creator shyam-rajaram.
- 647 type InProceedings.
- 647 label "A Novel Traffic Analysis for Identifying Search Fields in the Long Tail of Web Sites".
- 647 sameAs 647.
- 647 abstract "Using a clickstream sample of 2 billion URLs from many thousand volunteer Web users, we wish to analyze typical usage of keyword searches across the Web. In order to do this, we need to be able to determine whether a given URL represents a keyword search and, if so, which field contains the query. Although it is easy to recognize `q' as the query field in `http://www.google.com/search?hl=en\&q=music', we must do this automatically for the long tail of diverse websites. This problem is the focus of this paper. Since the names, types and number of fields differ across sites, this does not conform to traditional text classification or to multi-class problem formulations. The problem also exhibits highly non-uniform importance across websites, since traffic follows a Zipf distribution. We developed a solution based on manually identifying the query fields on the most popular sites, followed by an adaptation of machine learning for the rest. It involves an interesting case-instances structure: labeling each website `case' usually involves selecting at most one of the field `instances' as positive, based on seeing sample field values. This problem structure and soft constraint---which we believe has broader applicability---can be used to greatly reduce the manual labeling effort. We employed active learning and judicious GUI presentation to efficiently train a classifier with accuracy estimated at 96%, beating several baseline alternatives.".
- 647 hasAuthorList authorList.
- 647 isPartOf proceedings.
- 647 keyword "Query log".
- 647 keyword "click trail".
- 647 keyword "traffic data mining".
- 647 title "A Novel Traffic Analysis for Identifying Search Fields in the Long Tail of Web Sites".