Query classification using Wikipedia

Identifying the intended topic that underlies a user's query can benefit a large range of applications, from search engines to question-answering systems. However, query classification remains a difficult challenge due to the variety of queries a user can ask, the wide range of topics users can ask about, and the limited amount of information that can be mined from the query. In this paper, we develop a new query classification system that accounts for these three challenges. Our system relies on the freely-available online encyclopedia Wikipedia as a natural-language knowledge-based, and exploits Wikipedia's structure to infer the correct classification of any given query. We will present two variants of this query classification system in this paper, and demonstrate their reliability compared to each other and to the literature benchmarks using the query sets from the KDD CUP 2005 and TREC 2007 competitions.

[1]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[2]  Jinzhong Xu,et al.  Domain Ontology Based Automatic Question Answering , 2009, 2009 International Conference on Computer Engineering and Technology.

[3]  Chee Wee Leong,et al.  Exploiting Wikipedia for Directional Inferential Text Similarity , 2008, Fifth International Conference on Information Technology: New Generations (itng 2008).

[4]  Martin Wattenberg,et al.  Proceedings of the 40th Hawaii International Conference on System Sciences- 2007 Talk Before You Type: Coordination in Wikipedia , 2022 .

[5]  Xueqi Cheng,et al.  Semantic Convergence of Wikipedia Articles , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[6]  Ying Li,et al.  KDD CUP-2005 report: facing a great challenge , 2005, SKDD.

[7]  Na Ye,et al.  Automatic Web Query Classification Using Large Unlabeled Web Pages , 2008, 2008 The Ninth International Conference on Web-Age Information Management.

[8]  Amit P. Sheth,et al.  How Contents Influence Clustering Features in the Web , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[9]  Jimmy J. Lin,et al.  Overview of the TREC 2007 Question Answering Track , 2008, TREC.

[10]  Gang Wang,et al.  Understanding user's query intent with wikipedia , 2009, WWW '09.

[11]  P. Ingwersen,et al.  Proceedings of ISSI 2005 – The 10th International Conference of the International Society for Scientometrics and Informetrics: Stockholm, Sweden, July 24-28, 2005 , 2005 .

[12]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[13]  Qiang Yang,et al.  Q2C@UST: our winning solution to query classification in KDDCUP 2005 , 2005, SKDD.

[14]  Gilad Mishne,et al.  Using Wikipedia at the TREC QA Track , 2004, TREC.

[15]  Péter Schönhofen Identifying document topics using the Wikipedia category network , 2009, Web Intell. Agent Syst..

[16]  Rada Mihalcea,et al.  Using Wikipedia for Automatic Word Sense Disambiguation , 2007, NAACL.

[17]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[18]  Markus Krötzsch,et al.  Semantic Wikipedia , 2006, WikiSym '06.

[19]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[20]  Ophir Frieder,et al.  Improving automatic query classification via semi-supervised learning , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[21]  Ee-Peng Lim,et al.  Measuring Qualities of Articles Contributed by Online Communities , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).