Coupling feature selection and machine learning methods for navigational query identification

It is important yet hard to identify navigational queries in Web search due to a lack of sufficient information in Web queries, which are typically very short. In this paper we study several machine learning methods, including naive Bayes model, maximum entropy model, support vector machine (SVM), and stochastic gradient boosting tree (SGBT), for navigational query identification in Web search. To boost the performance of these machine techniques, we exploit several feature selection methods and propose coupling feature selection with classification approaches to achieve the best performance. Different from most prior work that uses a small number of features, in this paper, we study the problem of identifying navigational queries with thousands of available features, extracted from major commercial search engine results, Web search user click data, query log, and the whole Web's relational content. A multi-level feature extraction system is constructed.Our results on real search data show that 1) Among all the features we tested, user click distribution features are the most important set of features for identifying navigational queries. 2) In order to achieve good performance, machine learning approaches have to be coupled with good feature selection methods. We find that gradient boosting tree, coupled with linear SVM feature selection is most effective. 3) With carefully coupled feature selection and classification approaches, navigational queries can be accurately identified with 88.1% F1 score, which is 33% error rate reduction compared to the best uncoupled system, and 40% error rate reduction compared to a well tuned system without feature selection.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[3]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[4]  E. T. Jaynes,et al.  Papers on probability, statistics and statistical physics , 1983 .

[5]  Qiang Yang,et al.  Q2C@UST: our winning solution to query classification in KDDCUP 2005 , 2005, SKDD.

[6]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[7]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[8]  J. Friedman Stochastic gradient boosting , 2002 .

[9]  Zhenyu Liu,et al.  Automatic identification of user goals in Web search , 2005, WWW '05.

[10]  In-Ho Kang,et al.  Query type classification for web document retrieval , 2003, SIGIR.

[11]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[12]  Ophir Frieder,et al.  Improving automatic query classification via semi-supervised learning , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[13]  Luis Gravano,et al.  Categorizing web queries according to geographical locality , 2003, CIKM '03.

[14]  Michael I. Jordan,et al.  Robust Sparse Hyperplane Classifiers: Application to Uncertain Molecular Profiling Data , 2004, J. Comput. Biol..

[15]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[16]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[17]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[18]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Lee Sherman,et al.  Banner advertising: Measuring effectiveness and optimizing placement , 2001 .

[20]  Song-Chun Zhu,et al.  Statistical Modeling and Conceptualization of Visual Patterns , 2003, IEEE Trans. Pattern Anal. Mach. Intell..