Learning effective ranking functions for newsgroup search

Web communities are web virtual broadcasting spaces where people can freely discuss anything. While such communities function as discussion boards, they have even greater value as large repositories of archived information. In order to unlock the value of this resource, we need an effective means for searching archived discussion threads. Unfortunately the techniques that have proven successful for searching document collections and the Web are not ideally suited to the task of searching archived community discussions. In this paper, we explore the problem of creating an effective ranking function to predict the most relevant messages to queries in community search. We extract a set of predictive features from the thread trees of newsgroup messages as well as features of message authors and lexical distribution within a message thread. Our final results indicate that when using linear regression with this feature set, our search system achieved a 28.5% performance improvement compared to our baseline system.

[1]  Weiguo Fan,et al.  A generic ranking function discovery framework by genetic programming for information retrieval , 2004, Inf. Process. Manag..

[2]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[3]  Wei-Ying Ma,et al.  Improving pseudo-relevance feedback in web information retrieval using web page segmentation , 2003, WWW '03.

[4]  Edward A. Fox,et al.  Machine Learning Approach for Homepage Finding Task , 2002, TREC.

[5]  Wei-Ying Ma,et al.  Visual Based Content Understanding towards Web Adaptation , 2002, AH.

[6]  Scott LeeTiernan,et al.  Observed behavior and perceived value of authors in usenet newsgroups: bridging the gap , 2002, CHI.

[7]  M. Lalmas,et al.  A model for the representation and focussed retrieval of structured documents based on fuzzy aggregation , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[8]  Soumen Chakrabarti,et al.  Enhanced topic distillation using text, markup tags, and hyperlinks , 2001, SIGIR '01.

[9]  Baoyao Zhou,et al.  Function-based object model towards website adaptation , 2001, WWW '01.

[10]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[11]  Aitao Chen,et al.  A comparison of regression, neural net, and pattern recognition approaches to IR , 1998, CIKM '98.

[12]  Justin Zobel,et al.  Passage retrieval revisited , 1997, SIGIR '97.

[13]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[14]  Stephen E. Robertson,et al.  Overview of the Okapi projects , 1997, J. Documentation.

[15]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[16]  Alistair Moffat,et al.  Efficient Retrieval of Partial Documents , 1995, Inf. Process. Manag..

[17]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[18]  Peter Schäuble,et al.  Document and passage retrieval based on hidden Markov models , 1994, SIGIR '94.

[19]  Ross Wilkinson,et al.  Effective retrieval of structured documents , 1994, SIGIR '94.

[20]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[21]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[22]  Norbert Fuhr,et al.  Integration of probabilistic fact and text retrieval , 1992, SIGIR '92.

[23]  David D. Lewis,et al.  Applying Support Vector Machines to the TREC-2001 Batch Filtering and Routing Tasks , 2001, TREC.

[24]  Sumio Fujita,et al.  More Reflections on "Aboutness" TREC-2001 Evaluation Experiments at Justsystem , 2001, TREC.

[25]  Wensi Xi,et al.  Combining multiple sources of evidence for information retrieval , 2001 .

[26]  Garrison W. Cottrell,et al.  Fusion Via Linear Combination for the Routing Problem , 1997, TREC.

[27]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[28]  Fredric C. Gey,et al.  Logistic Regression at TREC4: Probabilistic Retrieval from Full Text Document Collections , 1995, TREC.

[29]  Donna K. Harman,et al.  Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[30]  Alistair Moffat,et al.  Retrieval of Partial Documents , 1993, TREC.

[31]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.