Retrieving similar discussion forum threads: a structure based approach

Online forums are becoming a popular way of finding useful information on the web. Search over forums for existing discussion threads so far is limited to keyword-based search due to the minimal effort required on part of the users. However, it is often not possible to capture all the relevant context in a complex query using a small number of keywords. Example-based search that retrieves similar discussion threads given one exemplary thread is an alternate approach that can help the user provide richer context and vastly improve forum search results. In this paper, we address the problem of finding similar threads to a given thread. Towards this, we propose a novel methodology to estimate similarity between discussion threads. Our method exploits the thread structure to decompose threads in to set of weighted overlapping components. It then estimates pairwise thread similarities by quantifying how well the information in the threads are mutually contained within each other using lexical similarities between their underlying components. We compare our proposed methods on real datasets against state-of-the-art thread retrieval mechanisms wherein we illustrate that our techniques outperform others by large margins on popular retrieval evaluation measures such as NDCG, MAP, Precision@k and MRR. In particular, consistent improvements of up to 10% are observed on all evaluation measures.

[1]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[2]  Jaime G. Carbonell,et al.  It pays to be picky: an evaluation of thread retrieval in online forums , 2009, SIGIR.

[3]  Mark T. Maybury,et al.  Advances in Automatic Text Summarization , 1999 .

[4]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[5]  Wei-Ying Ma,et al.  Building implicit links from content for forum search , 2006, SIGIR.

[6]  W. Bruce Croft,et al.  Online community search using thread structure , 2009, CIKM.

[7]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[8]  Harold Neil Gabow,et al.  Implementation of algorithms for maximum matching on nonbipartite graphs , 1973 .

[9]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[10]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[11]  Richard Chbeir,et al.  A Hybrid Approach for XML Similarity , 2007, SOFSEM.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Remco C. Veltkamp,et al.  Content-based image retrieval systems: A survey , 2000 .

[14]  Li Zhang,et al.  PostingRank: Bringing Order to Web Forum Postings , 2008, AIRS.

[15]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[16]  Jennifer Golbeck,et al.  Trust and nuanced profile similarity in online social networks , 2009, TWEB.

[17]  Hui Lin,et al.  Multi-document Summarization via Budgeted Maximization of Submodular Functions , 2010, NAACL.

[18]  A. Tversky Features of Similarity , 1977 .

[19]  ChengXiang Zhai,et al.  Exploiting Thread Structures to Improve Smoothing of Language Models for Forum Post Retrieval , 2011, ECIR.

[20]  Prasad Deshpande,et al.  Efficient online top-K retrieval with arbitrary similarity measures , 2008, EDBT '08.

[21]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[22]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[23]  Devavrat Shah,et al.  Message Passing for Max-weight Independent Set , 2007, NIPS.

[24]  Richard Chbeir,et al.  An overview on XML similarity: Background, current trends and future directions , 2009, Comput. Sci. Rev..

[25]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[26]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[27]  Prasenjit Mitra,et al.  Adopting Inference Networks for Online Thread Retrieval , 2010, AAAI.

[28]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[29]  George Forman,et al.  Finding similar files in large document repositories , 2005, KDD '05.

[30]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[31]  Ran Duan,et al.  Approximating Maximum Weight Matching in Near-Linear Time , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.