Utilizing sub-topical structure of documents for information retrieval

Text segmentation in natural language processing typically refers to the process of decomposing a document into constituent subtopics. Our work centers on the application of text segmentation techniques within information retrieval (IR) tasks. For example, for scoring a document by combining the retrieval scores of its constituent segments, exploiting the proximity of query terms in documents for ad-hoc search, and for question answering (QA), where retrieved passages from multiple documents are aggregated and presented as a single document to a searcher. Feedback in ad-hoc IR task is shown to benefit from the use of extracted sentences instead of terms from the pseudo relevant documents for query expansion. Retrieval effectiveness for patent prior art search task is enhanced by applying text segmentation to the patent queries. Another aspect of our work involves augmenting text segmentation techniques to produce segments which are more readable with less unresolved anaphora. This is particularly useful for QA and snippet generation tasks where the objective is to aggregate relevant and novel information from multiple documents satisfying user information need on one hand, and ensuring that the automatically generated content presented to the user is easily readable without reference to the original source document.

[1]  Gareth J. F. Jones,et al.  Query Expansion for Language Modeling Using Sentence Similarities , 2011, IRFC.

[2]  Ron Sacks-Davis,et al.  Similarity Measures for Short Queries , 1995, TREC.

[3]  Walid Magdy,et al.  Patent query reduction using pseudo relevance feedback , 2011, CIKM '11.

[4]  Gareth J. F. Jones,et al.  Applying summarization techniques for term selection in relevance feedback , 2001, SIGIR '01.

[5]  Walid Magdy,et al.  Exploring Structured Documents and Query Formulation Techniques for Patent Retrieval , 2009, CLEF.

[6]  Gareth J. F. Jones,et al.  Simulation of Within-Session Query Variations Using a Text Segmentation Approach , 2011, CLEF.

[7]  Johannes Leveling,et al.  United we fall, divided we stand: a study of query segmentation and prf for patent prior art search , 2011, PaIR '11.

[8]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[9]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[10]  Patrice Bellot,et al.  Overview of the INEX 2011 Question Answering Track (QA@INEX) , 2011, INEX.

[11]  Igor Malioutov,et al.  Minimum Cut Model for Spoken Lecture Segmentation , 2006, ACL.

[12]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[13]  Alistair Moffat,et al.  Retrieval of Partial Documents , 1993, TREC.

[14]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[15]  Kazuaki Kishida Experiment on Pseudo Relevance Feedback Method Using Taylor Formula at NTCIR-3 Patent Retrieval Task , 2002, NTCIR.

[16]  Walid Magdy,et al.  Simple vs. Sophisticated Approaches for Patent Prior-Art Search , 2011, ECIR.

[17]  Jeffrey C. Reynar Statistical Models for Topic Segmentation , 1999, ACL.

[18]  Naohiko Uramoto,et al.  Experiments on Patent Retrieval at NTCIR-5 Workshop , 2004, NTCIR.

[19]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Novelty Track. , 2005 .

[20]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[21]  Marcia J. Bates,et al.  The design of browsing and berrypicking techniques for the online search interface , 1989 .

[22]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[23]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[24]  Patrice Bellot,et al.  Overview of the 2009 QA Track: Towards a Common Task for QA, Focused IR and Automatic Summarization Systems , 2009, INEX.