Improving context-aware query classification via adaptive self-training

Topical classification of user queries is critical for general-purpose web search systems. It is also a challenging task, due to the sparsity of query terms and the lack of labeled queries. On the other hand, search contexts embedded in query sessions and unlabeled queries free on the web have not been fully utilized in most query classification systems. In this work, we leverage these information to improve query classification accuracy. We first incorporate search contexts into our framework using a Conditional Random Field (CRF) model. Discriminative training of CRFs is favored over the traditional maximum likelihood training because of its robustness to noise. We then adapt self-training with our model to exploit the information in unlabeled queries. By investigating different confidence measurements and model selection strategies, we effectively avoid the error-reinforcing nature of self-training. In extensive experiments on real search logs, we have averaged around 20% improvement in classification accuracy over other state-of-the-art baselines.

[1]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[2]  W. Bruce Croft The Role of Context and Adaptation in User Interfaces , 1984, Int. J. Man Mach. Stud..

[3]  Noah A. Smith,et al.  Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions , 2010, NAACL.

[4]  Tong Zhang,et al.  The Value of Unlabeled Data for Classification Problems , 2000, ICML 2000.

[5]  Sanna Talja,et al.  The production of context in information seeking research: a metatheoretical view , 1999, Inf. Process. Manag..

[6]  Carl-Erik W. Sundberg,et al.  List Viterbi decoding algorithms with applications , 1994, IEEE Trans. Commun..

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  Lawrence K. Saul,et al.  Large Margin Hidden Markov Models for Automatic Speech Recognition , 2006, NIPS.

[9]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[10]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[11]  Ophir Frieder,et al.  Improving automatic query classification via semi-supervised learning , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[12]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[13]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[14]  Ben Taskar,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[15]  James E. Pitkow,et al.  Characterizing Browsing Strategies in the World-Wide Web , 1995, Comput. Networks ISDN Syst..

[16]  Amanda Spink,et al.  Defining a session on Web search engines , 2007, J. Assoc. Inf. Sci. Technol..

[17]  Xiao Li,et al.  Learning query intent from regularized click graphs , 2008, SIGIR '08.

[18]  Dale Schuurmans,et al.  Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling , 2006, ACL.

[19]  Yixin Chen,et al.  Gradient-Based Feature Selection for Conditional Random Fields and its Applications in Computational Genetics , 2009, 2009 21st IEEE International Conference on Tools with Artificial Intelligence.

[20]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[21]  Rosie Jones,et al.  Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs , 2008, CIKM '08.

[22]  Wei-Ying Ma,et al.  Probabilistic query expansion using query logs , 2002, WWW '02.

[23]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[24]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[25]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[26]  E PitkowJames,et al.  Characterizing browsing strategies in the World-Wide Web , 1995 .

[27]  Mikhail Belkin,et al.  Regularization and Semi-supervised Learning on Large Graphs , 2004, COLT.

[28]  Enhong Chen,et al.  Context-aware query classification , 2009, SIGIR.

[29]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[30]  Gideon S. Mann,et al.  Simple, robust, scalable semi-supervised learning via expectation regularization , 2007, ICML '07.

[31]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[32]  Ayse Goker Context learning in Okapi , 1997 .

[33]  Amanda Spink,et al.  Defining a session on Web search engines: Research Articles , 2007 .

[34]  James E. Pitkow,et al.  Characterizing Browsing Behaviors on the World-Wide Web , 1995 .

[35]  Haym Hirsh,et al.  Improving Short-Text Classification using Unlabeled Data for Classification Problems , 2000, ICML.