Retrieval using document structure and annotations

Successful retrieval of information from text collections requires effective use of the information present in a collection. The structure of documents in the collection and the relationships between elements within a document and other documents contain important information about the meaning of these elements. For example, the words present in the title of a web page may contain important clues about that page's content. The text of a link to the web page may also be an important indicator of the page's content. Researchers have long recognized that structure can be an important indicator of relevance. Yet the majority of prior work is limited to experiments on small test collections and evaluated on a single retrieval task. These limitations hamper the generality of the conclusions. The recent construction of large and diverse test collections provides us the opportunity to reconsider the general task of retrieval in collections with structure. This dissertation draws on three retrieval tasks to identify important properties of retrieval systems supporting the use of structure and annotations. We investigate known-item finding of web pages, retrieving elements from XML articles, and the retrieval of answer-bearing sentences as a component of a question-answering system. The retrieval model, an adaptation of the Inference Network model, clarifies the query language and simplifies the process of smoothing using multiple representations. The experiments in this dissertation show state-of-the-art results for these tasks and also provide novel insights to the shape of the parameter space when using mixtures of language models. Our experiments with question-answering further show how semantic predicates automatically annotated on a collection can be used to improve a system's ability to retrieve answer-bearing sentences.

[1]  Koby Crammer,et al.  Pranking with Ranking , 2001, NIPS.

[2]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[3]  James P. Callan,et al.  Structured retrieval for question answering , 2007, SIGIR.

[4]  Daniel Jurafsky,et al.  Shallow Semantic Parsing using Support Vector Machines , 2004, NAACL.

[5]  Patrick Gallinari,et al.  Machine Learning Ranking and INEX'05 , 2005, INEX.

[6]  Jaap Kamps,et al.  The Effect of Structured Queries and Selective Indexing on XML Retrieval , 2005, INEX.

[7]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[8]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[9]  S. Robertson The probability ranking principle in IR , 1997 .

[10]  Forbes J. Burkowski Retrieval activities in a database consisting of heterogeneous collections of structured text , 1992, SIGIR '92.

[11]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[12]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[13]  Edward A. Fox,et al.  Research Contributions , 2014 .

[14]  W. Bruce Croft,et al.  Hierarchical Language Models for Expert Finding in Enterprise Corpora , 2006, 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'06).

[15]  Mitchell P. Marcus,et al.  Adding Semantic Annotation to the Penn TreeBank , 1998 .

[16]  Ray R. Larson,et al.  A Fusion Approach to XML Structured Document Retrieval , 2005, Information Retrieval.

[17]  Tao Qin,et al.  Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004 , 2004, TREC.

[18]  Gabriella Kazai Initiative for the Evaluation of XML Retrieval , 2009 .

[19]  Stephen E. Robertson,et al.  Microsoft Cambridge at TREC 13: Web and Hard Tracks , 2004, TREC.

[20]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[21]  Javed A. Aslam,et al.  Relevance score normalization for metasearch , 2001, CIKM '01.

[22]  Djoerd Hiemstra,et al.  TIJAH at INEX 2004 Modeling Phrases and Relevance Feedback , 2004, INEX.

[23]  M. de Rijke,et al.  The Importance of Length Normalization for XML Retrieval , 2005, Information Retrieval.

[24]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[25]  Jaime G. Carbonell,et al.  Suppressing outliers in pairwise preference ranking , 2008, CIKM '08.

[26]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[27]  Yosi Mass,et al.  Component Ranking and Automatic Query Refinement for XML Retrieval , 2004, INEX.

[28]  Yosi Mass,et al.  Using the INEX Environment as a Test Bed for Various User Models for XML Retrieval , 2005, INEX.

[29]  David Hawking,et al.  Overview of the TREC 2003 Web Track , 2003, TREC.

[30]  W. Bruce Croft,et al.  Direct Maximization of Rank-Based Metrics for Information Retrieval , 2005 .

[31]  Le Zhao,et al.  A generative retrieval model for structured documents , 2008, CIKM '08.

[32]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[33]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[34]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[35]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[36]  Vijay V. Raghavan,et al.  A critical analysis of vector space model for information retrieval , 1986, J. Am. Soc. Inf. Sci..

[37]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[38]  Gilad Mishne,et al.  Language Models for Searching in Web Corpora , 2004, TREC.

[39]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[40]  Jennifer Chu-Carroll,et al.  Semantic search via XML fragments: a high-precision approach to IR , 2006, SIGIR.

[41]  Charles L. A. Clarke,et al.  Schema-Independent Retrieval from Heterogeneous Structured Text , 1994 .

[42]  W. Bruce Croft,et al.  Indri at TREC 2005: Terabyte Track , 2005, TREC.

[43]  David Hawking,et al.  Overview of the TREC-2001 Web track , 2002 .

[44]  Djoerd Hiemstra,et al.  The TIJAH XML-IR system at INEX 2003 , 2003, INEX.

[45]  David Carmel,et al.  JuruXML - an XML Retrieval System at INEX'02 , 2002, INEX Workshop.

[46]  Djoerd Hiemstra,et al.  Retrieving Web Pages Using Content, Links, URLs and Anchors , 2001, TREC.

[47]  Maarten de Rijke,et al.  Length normalization in XML retrieval , 2004, SIGIR '04.

[48]  Larry Wasserman,et al.  All of Statistics , 2004 .

[49]  Yoram Singer,et al.  Learning to Order Things , 1997, NIPS.

[50]  James P. Callan,et al.  Hierarchical Language Models for XML Component Retrieval , 2004, INEX.

[51]  Armin B. Cremers,et al.  Searching and browsing collections of structural information , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[52]  Charles L. A. Clarke,et al.  An overview of multitext , 1998, SIGF.

[53]  Benjamin Piwowarski,et al.  An Algebra for Structured Queries in Bayesian Networks , 2004, INEX.

[54]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[55]  W. Bruce Croft,et al.  Indri at TREC 2004: Terabyte Track , 2004, TREC.

[56]  Jian Hu,et al.  SJTU at TREC 2004: Web Track Experiments , 2004, TREC.

[57]  Edward A. Fox,et al.  A Knowledge-Based System for Composite Document Analysis and Retrieval: Design Issues in the CODER Project , 1986 .

[58]  Ophir Frieder,et al.  IIT at TREC 2003, Task Classification and Document Structrure for Known-Item Search , 2003, TREC.

[59]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[60]  Donald Metzler Using gradient descent to optimize language modeling smoothing parameters , 2007, SIGIR.

[61]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[62]  Ricardo A. Baeza-Yates,et al.  Proximal nodes: a model to query document databases by content and structure , 1997, TOIS.

[63]  Matthew W. Bilotti,et al.  Query expansion techniques for question answering , 2004 .

[64]  Martha Palmer,et al.  Adding predicate argument structure to the Penn TreeBank , 2002 .

[65]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[66]  M. de Rijke,et al.  An Element-based Approach to XML Retrieval , 2004 .

[67]  SchwartzRichard,et al.  An Algorithm that Learns Whats in a Name , 1999 .

[68]  Edward A. Fox,et al.  Composite document extended retrieval: an overview , 1985, SIGIR '85.

[69]  James P. Callan,et al.  Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding , 2003, TREC.

[70]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[71]  Stephen E. Robertson,et al.  Field-Weighted XML Retrieval Based on BM25 , 2005, INEX.

[72]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[73]  Matthew W. Bilotti,et al.  Linguistic and semantic passage retrieval strategies for question answering , 2011, SIGF.

[74]  Michael Fuller,et al.  Structured answers for a large structured document collection , 1993, SIGIR.

[75]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[76]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[77]  W. Bruce Croft,et al.  A Probabilistic Retrieval Model for Semistructured Data , 2009, ECIR.

[78]  Ross Wilkinson,et al.  Effective retrieval of structured documents , 1994, SIGIR '94.

[79]  Sumio Fujita,et al.  More Reflections on "Aboutness" TREC-2001 Evaluation Experiments at Justsystem , 2001, TREC.

[80]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[81]  Robert Wing Pong Luk,et al.  A Generative Theory of Relevance , 2008, The Information Retrieval Series.

[82]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[83]  James P. Callan,et al.  Experiments with Language Models for Known-Item Finding of E-mail Messages , 2005, TREC.

[84]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing , 1974 .

[85]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[86]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[87]  Maarten de Rijke,et al.  Processing content-oriented XPath queries , 2004, CIKM '04.

[88]  David Hawking,et al.  Overview of the TREC 2004 Web Track , 2004, TREC.

[89]  W. Bruce Croft,et al.  Formal multiple-bernoulli models for language modeling , 2004, SIGIR '04.

[90]  Sung-Hyon Myaeng,et al.  A flexible model for retrieval of SGML documents , 1998, SIGIR '98.

[91]  William W. Cohen,et al.  A Meta-Learning Approach for Robust Rank Learning , 2008 .

[92]  M. de Rijke,et al.  Mixture Models, Overlap, and Structural Hints in XML Element Retrieval , 2004, INEX.