Improving classification accuracy using automatically extracted training data

Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used. We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and non-commercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large accuracy gains using automatically extracted training data at much lower cost.

[1]  Carla E. Brodley,et al.  Class Noise Mitigation Through Instance Weighting , 2007, ECML.

[2]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[3]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[4]  Fabrice Muhlenbach,et al.  Identifying and Handling Mislabelled Instances , 2004, Journal of Intelligent Information Systems.

[5]  Alan Halverson,et al.  Generating labels from clicks , 2009, WSDM '09.

[6]  Ying Li,et al.  Detecting online commercial intention (OCI) , 2006, WWW '06.

[7]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[8]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[9]  Ophir Frieder,et al.  Automatic web query classification using labeled and unlabeled training data , 2005, SIGIR '05.

[10]  Tony R. Martinez,et al.  A noise filtering method using neural networks , 2003, IEEE International Workshop on Soft Computing Techniques in Instrumentation, Measurement and Related Applications, 2003. SCIMA 2003..

[11]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[12]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[13]  Ricardo A. Baeza-Yates,et al.  The Intention Behind Web Queries , 2006, SPIRE.

[14]  Xiao Li,et al.  Learning query intent from regularized click graphs , 2008, SIGIR '08.

[15]  Preslav Nakov,et al.  Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution , 2005, HLT.

[16]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.