A Semantic Graph based Topic Model for Question Retrieval in Community Question Answering

Community Question Answering (CQA) services, such as Yahoo! Answers and WikiAnswers, have become popular with users as one of the central paradigms for satisfying users' information needs. The task of question retrieval aims to resolve one's query directly by finding the most relevant questions (together with their answers) from an archive of past questions. However, as the text of each question is short, there is usually a lexical gap between the queried question and the past questions. To alleviate this problem, we present a hybrid approach that blends several language modelling techniques for question retrieval, namely, the classic (query-likelihood) language model, the state-of-the-art translation-based language model, and our proposed semantics-based language model. The semantics of each candidate question is given by a probabilistic topic model which makes use of local and global semantic graphs for capturing the hidden interactions among entities (e.g., people, places, and concepts) in question-answer pairs. Experiments on two real-world datasets show that our approach can significantly outperform existing ones.

[1]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[2]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[3]  Yizhou Sun,et al.  ETM: Entity Topic Models for Mining Documents Associated with Entities , 2012, 2012 IEEE 12th International Conference on Data Mining.

[4]  Xu Chen,et al.  The contextual focused topic model , 2012, KDD.

[5]  Caroline Sporleder,et al.  Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection , 2010, ACL.

[6]  Liangjie Hong,et al.  A time-dependent topic model for multiple text streams , 2011, KDD.

[7]  Ben He,et al.  Question-answer topic model for question retrieval in community question answering , 2012, CIKM.

[8]  Yue Lu,et al.  Latent aspect rating analysis on review text data: a rating regression approach , 2010, KDD.

[9]  W. Bruce Croft,et al.  Retrieval models for question and answer archives , 2008, SIGIR '08.

[10]  Christian S. Jensen,et al.  The use of categorization information in language models for question retrieval , 2009, CIKM.

[11]  Wang-Chien Lee,et al.  A probabilistic topic-based ranking framework for location-sensitive domain information retrieval , 2009, SIGIR.

[12]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[13]  Rainer Lienhart,et al.  Image retrieval on large-scale image databases , 2007, CIVR '07.

[14]  Christian S. Jensen,et al.  A generalized framework of exploring category information for question retrieval in community question answer archives , 2010, WWW '10.

[15]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval , 2008, NAACL.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[18]  Nigel Collier,et al.  A partially supervised cross-collection topic model for cross-domain text classification , 2013, CIKM.

[19]  Simone Paolo Ponzetto,et al.  Knowledge-based graph document modeling , 2014, WSDM.

[20]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[21]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[22]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[23]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[24]  Bei Yu,et al.  A cross-collection mixture model for comparative text mining , 2004, KDD.

[25]  Ivan Titov,et al.  Modeling online reviews with multi-grain topic models , 2008, WWW.

[26]  Derek Greene,et al.  Unsupervised graph-based topic labelling using dbpedia , 2013, WSDM.

[27]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[28]  Li Cai,et al.  Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge , 2011, CIKM '11.

[29]  Weiwei Guo,et al.  Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions , 2011, EMNLP.

[30]  W. Bruce Croft,et al.  Finding similar questions in large question and answer archives , 2005, CIKM '05.

[31]  Fang Liu,et al.  Improving Question Retrieval in Community Question Answering Using World Knowledge , 2013, IJCAI.

[32]  Xiaohua Hu,et al.  Incorporating word correlation into tag-topic model for semantic knowledge acquisition , 2012, CIKM '12.

[33]  Jiawei Han,et al.  Modeling hidden topics on document manifold , 2008, CIKM '08.

[34]  Martine D. F. Schlag,et al.  Spectral K-Way Ratio-Cut Partitioning and Clustering , 1993, 30th ACM/IEEE Design Automation Conference.

[35]  Bo Zhao,et al.  Probabilistic topic models with biased propagation on heterogeneous information networks , 2011, KDD.