Matches in ScholarlyData for { <https://w3id.org/scholarlydata/inproceedings/www2010/paper/main/206> ?p ?o. }
Showing items 1 to 15 of
15
with 100 items per page.
- 206 creator david-soukal.
- 206 creator fritz-behr.
- 206 creator hongwen-kang.
- 206 creator kuansan-wang.
- 206 creator zijian-zheng.
- 206 type InProceedings.
- 206 label "0-Cost Semisupervised Bot Detection for Search Engines".
- 206 sameAs 206.
- 206 abstract "In this paper, we propose a semi-supervised learning approach for classifying program (bot) generated web search traffic from that of genuine human users. This is a crucial problem for web search engine because bot traffic significantly affect both the realtime search engine performance and the quality of the offline data-mining results of the search logs. However, the enormous amount of search data pose a challenging problem for traditional approaches that rely on fully annotated training samples. To this end, we propose a novel semi-supervised framework that addresses the problem in multiple fronts. First, we use the CAPTCHA technique and simple heuristics to extract from the data logs search sessions that are likely to be recordings of genuine human users and bot generated. This step outputs a large set of training samples with initial labels, though directly using these training data sources is problematic because they are biased samples of the whole data space and therefoe can lead a machine learning system to acquire a skewed decision boundary that does not generalize well for unseen data. To tackle this problem, we further develop a semi-supervised learning algorithm that takes advantage of unlabeled data to improve the classification performance. These two proposed algorithms are seamlessly combined and have the following advantages. First, it becomes very cost efficient to generate large number of labeled data to initialize the training process. Second, our semi-supervised learning approach is very effective and resilient against the bias issue in the data generation process. In our experiment, the proposed approach showed significant (i.e. 2~3:1) improvement compared to the traditional supervised approach.".
- 206 hasAuthorList authorList.
- 206 isPartOf proceedings.
- 206 keyword "Query log".
- 206 keyword "click trail".
- 206 keyword "traffic data mining".
- 206 title "0-Cost Semisupervised Bot Detection for Search Engines".