Improving Term Weighting for Community Question Answering Search Using Syntactic Analysis

Query term weighting is a fundamental task in information retrieval and most popular term weighting schemes are primarily based on statistical analysis of term occurrences within the document collection. In this work we study how term weighting may benefit from syntactic analysis of the corpus. Focusing on community question answering (CQA) sites, we take into account the syntactic function of the terms within CQA texts as an important factor affecting their relative importance for retrieval. We analyze a large log of web queries that landed on Yahoo Answers site, showing a strong deviation between the tendencies of different document words to appear in a landing (click-through) query given their syntactic function. To this end, we propose a novel term weighting method that makes use of the syntactic information available for each query term occurrence in the document, on top of term occurrence statistics. The relative importance of each feature is learned via a learning to rank algorithm that utilizes a click-through query log. We examine the new weighting scheme using manual evaluation based on editorial data and using automatic evaluation over the query log. Our experimental results show consistent improvement in retrieval when syntactic information is taken into account.

[1]  Tat-Seng Chua,et al.  The Use of Dependency Relation Graph to Enhance the Term Weighting in Question Retrieval , 2012, COLING.

[2]  Quoc V. Le,et al.  Learning to Rank with Nonsmooth Cost Functions , 2006, Neural Information Processing Systems.

[3]  W. Bruce Croft,et al.  A quasi-synchronous dependence model for information retrieval , 2011, CIKM '11.

[4]  Eugene Agichtein,et al.  When web search fails, searchers become askers: understanding the transition , 2012, SIGIR '12.

[5]  Evgeniy Gabrilovich,et al.  Predicting web searcher satisfaction with existing community-based answers , 2011, SIGIR.

[6]  W. Bruce Croft,et al.  Retrieval models for question and answer archives , 2008, SIGIR '08.

[7]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[8]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[9]  W. Bruce Croft,et al.  Query term ranking based on dependency parsing of verbose queries , 2010, SIGIR '10.

[10]  Alan F. Smeaton,et al.  Natural language processing and information retrieval , 1990, Inf. Process. Manag..

[11]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[12]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[13]  Tie-Yan Liu,et al.  Adapting ranking SVM to document retrieval , 2006, SIGIR.

[14]  Gilad Mishne,et al.  Improving Web Search Relevance with Semantic Features , 2009, EMNLP.

[15]  James Fan,et al.  Textual evidence gathering and analysis , 2012, IBM J. Res. Dev..

[16]  Pinar Donmez,et al.  On the local optimality of LambdaRank , 2009, SIGIR.

[17]  Tao Qin,et al.  LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval , 2007 .

[18]  Enhong Chen,et al.  Improving search relevance for short queries in community question answering , 2014, WSDM.

[19]  Li Cai,et al.  Learning the Latent Topics for Question Retrieval in Community QA , 2011, IJCNLP.

[20]  James Allan,et al.  Using part-of-speech patterns to reduce query ambiguity , 2002, SIGIR '02.

[21]  Jianfeng Gao,et al.  Dependence language model for information retrieval , 2004, SIGIR '04.

[22]  Tie-Yan Liu,et al.  Listwise approach to learning to rank: theory and algorithm , 2008, ICML '08.

[23]  Maria Teresa Pazienza Information Extraction: Towards Scalable, Adaptable Systems , 1999 .

[24]  Fernando Diaz,et al.  Sources of evidence for vertical selection , 2009, SIGIR.

[25]  Michael Collins,et al.  A Statistical Parser for Czech , 1999, ACL.

[26]  Yong Yu,et al.  Searching Questions by Identifying Question Topic and Question Focus , 2008, ACL.

[27]  W. Bruce Croft,et al.  Finding similar questions in large question and answer archives , 2005, CIKM '05.

[28]  Pu-Jen Cheng,et al.  A term dependency-based approach for query terms ranking , 2009, CIKM.

[29]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[30]  Alan F. Smeaton,et al.  Using NLP or NLP Resources for Information Retrieval Tasks , 1999 .

[31]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[32]  Ricardo A. Baeza-Yates Challenges in the Interaction of Information Retrieval and Natural Language Processing , 2004, CICLing.

[33]  Tat-Seng Chua,et al.  Question answering passage retrieval using dependency relations , 2005, SIGIR '05.

[34]  Rosie Jones,et al.  The Linguistic Structure of English Web-Search Queries , 2008, EMNLP.

[35]  Chirag Shah,et al.  Evaluating high accuracy retrieval techniques , 2004, SIGIR '04.

[36]  Koby Crammer,et al.  Adaptive regularization of weight vectors , 2009, Machine Learning.

[37]  Kai Wang,et al.  A syntactic tree matching approach to finding similar questions in community-based qa services , 2009, SIGIR.

[38]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[39]  Christian S. Jensen,et al.  The use of categorization information in language models for question retrieval , 2009, CIKM.

[40]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.