Topic indexing and retrieval for open domain factoid question answering

Factoid Question Answering is an exciting area of Natural Language Engineering that has the potential to replace one major use of search engines today. In this dissertation, I introduce a new method of handling factoid questions whose answers are proper names. The method, Topic Indexing and Retrieval, addresses two issues that prevent current factoid QA system from realising this potential: They can’t satisfy users’ demand for almost immediate answers, and they can’t produce answers based on evidence distributed across a corpus. The first issue arises because the architecture common to QA systems is not easily scaled to heavy use because so much of the work is done on-line: Text retrieved by information retrieval (IR) undergoes expensive and time-consuming answer extraction while the user awaits an answer. If QA systems are to become as heavily used as popular web search engines, this massive process bottle-neck must be overcome. The second issue of how to make use of the distributed evidence in a corpus is relevant when no single passage in the corpus provides sufficient evidence for an answer to a given question. QA systems commonly look for a text span that contains sufficient evidence to both locate and justify an answer. But this will fail in the case of questions that require evidence from more than one passage in the corpus. Topic Indexing and Retrieval method developed in this thesis addresses both these issues for factoid questions with proper name answers by restructuring the corpus in such a way that it enables direct retrieval of answers using off-the-shelf IR. The method has been evaluated on 377 TREC questions with proper name answers and 41 questions that require multiple pieces of evidence from different parts of the TREC AQUAINT corpus. With regards to the first evaluation, scores of 0.340 in Accuracy and 0.395 in Mean Reciprocal Rank (MRR) show that the Topic Indexing and Retrieval performs well for this type of questions. A second evaluation compares performance on a corpus of 41 multi-evidence questions by a question-factoring baseline method that can be used with the standard QA architecture and by my Topic Indexing and Retrieval method. The superior performance of the latter (MRR of 0.454 against 0.341) demonstrates its value in answering such questions.

[1]  Roger C. Schank,et al.  Conceptual dependency: A theory of natural language understanding , 1972 .

[2]  Jennifer Chu-Carroll,et al.  A Multi-Strategy and Multi-Source Approach to Question Answering , 2002, TREC.

[3]  Rafael Muñoz,et al.  Splitting Complex Temporal Questions for Question Answering Systems , 2004, ACL.

[4]  Jimmy J. Lin,et al.  What Makes a Good Answer? The Role of Context in Question Answering , 2003, INTERACT.

[5]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[6]  A. V. Phillips,et al.  A Question-Answering Routine , 1960 .

[7]  Eric Brill,et al.  Automatic question answering using the web: Beyond the Factoid , 2006, Information Retrieval.

[8]  Jennifer Chu-Carroll,et al.  In Question Answering, Two Heads Are Better Than One , 2003, NAACL.

[9]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[10]  Sanda M. Harabagiu,et al.  Answer Mining by Combining Extraction Techniques with Abductive Reasoning , 2003, Text Retrieval Conference.

[11]  Ulf Hermjakob,et al.  Parsing and Question Classification for Question Answering , 2001, ACL 2001.

[12]  Valentin Jijkoun,et al.  Information Extraction for Question Answering: Improving Recall Through Syntactic Patterns , 2004, COLING.

[13]  T. Gonen,et al.  Questions , 1927, Journal of Family Planning and Reproductive Health Care.

[14]  Jörg Tiedemann,et al.  Question Answering for Dutch using Dependency Relations , 2005, CLEF.

[15]  李幼升,et al.  Ph , 1989 .

[16]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[17]  Ellen M. Voorhees,et al.  Overview of the TREC 2002 Question Answering Track , 2003, TREC.

[18]  Dragomir R. Radev,et al.  The Use of Predictive Annotation for Question Answering in TREC8 , 1999, TREC.

[19]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[20]  James R. Curran,et al.  Language Independent NER using a Maximum Entropy Tagger , 2003, CoNLL.

[21]  Diego Mollá Aliod,et al.  Question Answering in Restricted Domains: An Overview , 2007, CL.

[22]  Jimmy J. Lin,et al.  AskMSR: Question Answering Using the Worldwide Web , 2002 .

[23]  Charles L. A. Clarke,et al.  Exploiting redundancy in question answering , 2001, SIGIR '01.

[24]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[25]  Hong Yu,et al.  Beyond Information Retrieval - Medical Question Answering , 2006, AMIA.

[26]  Dell Zhang,et al.  Question classification using support vector machines , 2003, SIGIR.

[27]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[28]  Antonio Ferrandez,et al.  Importance of Pronominal Anaphora Resolution in Question Answering Systems , 2000, ACL 2000.

[29]  Peter Thanisch,et al.  Natural language interfaces to databases – an introduction , 1995, Natural Language Engineering.

[30]  Jörg Tiedemann,et al.  Simple is Best: Experiments with Different Document Segmentation Strategies for Passage Retrieval , 2008, COLING 2008.

[31]  Daniel Marcu,et al.  Natural Language Based Reformulation Resource and Wide Exploitation for Question Answering , 2002, TREC.

[32]  Christof Monz,et al.  Document Retrieval in the Context of Question Answering , 2003, ECIR.

[33]  Ian H. Witten,et al.  Managing gigabytes 2nd edition , 1999 .

[34]  Thomas S. Morton,et al.  Using Coreference for Question Answering , 1999, TREC.

[35]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[36]  Bonnie L. Webber,et al.  Nexus: a real time QA system , 2007, SIGIR.

[37]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[38]  Wendy Grace Lehnert,et al.  The Process of Question Answering , 2022 .

[39]  Hoa Trang Dang,et al.  Overview of DUC 2006 , 2006 .

[40]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[41]  Boris Katz,et al.  Syntactic and Semantic Decomposition Strategies for Question Answering from Multiple Resources * , 2005 .

[42]  Bernardo Magnini,et al.  Towards Automatic Evaluation of Question/Answering Systems , 2002, LREC.

[43]  Johan Bos,et al.  Question Answering with QED and Wee at TREC 2004 , 2004, TREC.

[44]  Valentin Jijkoun,et al.  Answer Selection in a Multi-stream Open Domain Question Answering System , 2004, ECIR.

[45]  R. Sutcliffe,et al.  Seeking an Upper Bound to Sentence Level Retrieval in Question Answering , 2004 .

[46]  Robert J. Gaizauskas,et al.  Evaluating Passage Retrieval Approaches for Question Answering , 2004, ECIR.

[47]  Michael Colclough The Process of Question Answering — A Computer Simulation of Cognition , 1979 .

[48]  Eduard H. Hovy,et al.  Offline Strategies for Online Question Answering: Answering Questions Before They Are Asked , 2003, ACL.

[49]  Jimmy J. Lin,et al.  Omnibase: Uniform Access to Heterogeneous Data for Question Answering , 2002, NLDB.

[50]  Jimmy J. Lin,et al.  Overview of the TREC 2006 ciQA task , 2007, SIGF.

[51]  Lonneke van der Plas,et al.  Anaphora Resolution for Off-line Answer Extraction using Instances , 1995 .

[52]  Lance A. Miller,et al.  Review of The process of question answering: a computer simulation of cognition by Wendy G. Lehnert. Lawrence Erlbaum Associates 1978. , 1980 .

[53]  Christof Monz Minimal Span Weighting Retrieval for Question Answering , 2004 .

[54]  Jimmy J. Lin,et al.  Data-Intensive Question Answering , 2001, TREC.

[55]  Bert F. Green,et al.  Baseball: an automatic question-answerer , 1899, IRE-AIEE-ACM '61 (Western).

[56]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[57]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[58]  Horacio Saggion,et al.  Exploring the Performance of Boolean Retrieval Strategies for Open Domain Question Answering , 2004 .

[59]  Bernardo Magnini,et al.  Is It the Right Answer? Exploiting Web Redundancy for Answer Validation , 2002, ACL.

[60]  Jungyun Seo,et al.  MAYA: A Fast Question-answering System Based on a Predictive Answer Indexer , 2001, ACL 2001.

[61]  W. Bruce Croft,et al.  A Translation Model for Sentence Retrieval , 2005, HLT.

[62]  Dan Roth,et al.  Question-Answering via Enhanced Understanding of Questions , 2002, TREC.

[63]  Susan T. Dumais,et al.  An Analysis of the AskMSR Question-Answering System , 2002, EMNLP.