Instance-based question answering

During recent years, question answering (QA) has grown from simple passage retrieval and information extraction to very complex approaches that incorporate deep question and document analysis, reasoning, planning, and sophisticated uses of knowledge resources. Most existing QA systems combine rule-based, knowledge-based and statistical components, and are highly optimized for a particular style of questions in a given language. Typical question answering approaches depend on specific ontologies, resources, processing tools, document sources, and very often rely on expert knowledge and rule-based components. Furthermore, such systems are very difficult to re-train and optimize for different domains and languages, requiring considerable time and human effort. We present a fully statistical, data-driven, instance-based approach to question answering (IBQA) that learns how to answer new questions from similar training questions and their known correct answers. We represent training questions as points in a multi-dimensional space and cluster them according to different granularity, scatter, and similarity metrics. From each individual cluster we automatically learn an answering strategy for finding answers to questions. When answering a new question that is covered by several clusters, multiple answering strategies are simultaneously employed. The resulting answer confidence combines elements such as each strategy's estimated probability of success, cluster similarity to the new question, cluster size, and cluster granularity. The IBQA approach obtains good performance on factoid and definitional questions, comparable to the performance of top systems participating in official question answering evaluations. Each answering strategy is cluster-specific and consists of an expected answer model, a query content model, and an answer extraction model. The expected answer model is derived from all training questions in its cluster and takes the form of a distribution over all possible answer types. The query content model for document retrieval is constructed using content from queries that are successful on training questions in that cluster. Finally, we train cluster-specific answer extractors on training data and use them to find answers to new questions. The IBQA approach is resource non-intensive, but can easily be extended to incorporate knowledge resources or rule-based components. Since it does not rely on hand-written rules, expert knowledge, and manually tuned parameters, it is less dependent on a particular language or domain, allowing for fast re-training with minimum human effort. Under limited data, our implementation of an IBQA system achieves good performance, improves with additional training instances, and is easily trainable and adaptable to new types of data. The IBQA approach provides a principled, robust, and easy to implement base system which constitutes a robust and well performing platform for further domain-specific adaptation.

[1]  Eugene W. Myers,et al.  Basic local alignment search tool. Journal of Molecular Biology , 1990 .

[2]  Krzysztof Czuba,et al.  Answering What-Is Questions by Virtual Annotation , 2001, HLT.

[3]  Christof Monz,et al.  Document Retrieval in the Context of Question Answering , 2003, ECIR.

[4]  Eric Nyberg,et al.  Exploiting Multiple Semantic Resources for Answer Selection , 2006, LREC.

[5]  Jaime G. Carbonell,et al.  Cluster-Based Selection of Statistical Answering Strategies , 2007, IJCAI.

[6]  Ji-Rong Wen,et al.  Query Clustering in the Web Context , 2003, Clustering and Information Retrieval.

[7]  Lucian Vlad Lita,et al.  Resource Analysis for Question Answering , 2004, ACL.

[8]  Fabio Rinaldi,et al.  Answering Questions in the Genomics Domain , 2004, ACL 2004.

[9]  James Allan,et al.  An Exploration of Entity Models, Collective Classification and Relation Description , 2004 .

[10]  Michael Collins,et al.  Answer Extraction , 2000, ANLP.

[11]  Ellen M. Voorhees,et al.  Overview of the TREC 2002 Question Answering Track , 2003, TREC.

[12]  Sanda M. Harabagiu,et al.  LCC Tools for Question Answering , 2002, TREC.

[13]  Robert J. Gaizauskas,et al.  Evaluating Passage Retrieval Approaches for Question Answering , 2004, ECIR.

[14]  Scott Miller,et al.  TREC 2002 QA at BBN: Answer Selection and Confidence Estimation , 2002, TREC.

[15]  Jaime G. Carbonell,et al.  The JAVELIN Question-Answering System at TREC 2003: A Multi-Strategh Approach with Dynamic Planning , 2003, TREC.

[16]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[17]  Eduard H. Hovy,et al.  Using Knowledge to Facilitate Factoid Answer Pinpointing , 2002, COLING.

[18]  Xiaoqiang Luo,et al.  A Mention-Synchronous Coreference Resolution Algorithm Based On the Bell Tree , 2004, ACL.

[19]  Jaime G. Carbonell,et al.  The JAVELIN Question-Answering System at TREC 2002 , 2002, TREC.

[20]  Jimmy J. Lin,et al.  Data-Intensive Question Answering , 2001, TREC.

[21]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[22]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[23]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[24]  John D. Burger,et al.  MITRE's Qanda at TREC 15 , 2006, TREC.

[25]  Inderjeet Mani,et al.  A Sys Called Qanda , 1999, TREC.

[26]  Charles L. A. Clarke,et al.  Exploiting redundancy in question answering , 2001, SIGIR '01.

[27]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[28]  Kathleen McKeown,et al.  DefScriber: a hybrid system for definitional QA , 2003, SIGIR '03.

[29]  Rohini K. Srihari,et al.  A Hybrid Approach for Named Entity and Sub-Type Tagging , 2000, ANLP.

[30]  Claire Cardie,et al.  Multi-Perspective Question Answering Using the OpQA Corpus , 2005, HLT.

[31]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[32]  Sanda M. Harabagiu,et al.  The Structure and Performance of an Open-Domain Question Answering System , 2000, ACL.

[33]  Jaime G. Carbonell,et al.  Instance-Based Question Answering: A Data-Driven Approach , 2004, EMNLP.

[34]  Alexey Radul,et al.  Nuggeteer: Automatic Nugget-Based Evaluation using Descriptions and Judgements , 2006, NAACL.

[35]  Maarten de Rijke,et al.  Overview of the CLEF 2004 Multilingual Question Answering Track , 2004, CLEF.

[36]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[37]  Sanda M. Harabagiu,et al.  The Informative Role of WordNet in Open-Domain Question Answering , 2004, HLT-NAACL 2004.

[38]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[39]  Luis Gravano,et al.  Learning search engine specific query transformations for question answering , 2001, WWW '01.

[40]  Lucian Vlad Lita,et al.  Qualitative Dimensions in Question Answering: Extending the Definitional QA Task , 2005, AAAI.

[41]  Tsuneaki Kato,et al.  Question Answering Challenge (QAC-1): An Evaluation of Question Answering Task at NTCIRWorkshop 3 , 2002, NTCIR.

[42]  Jinxi Xu,et al.  TREC 2003 QA at BBN: Answering Definitional Questions , 2003, TREC.

[43]  John A. Carroll,et al.  Applied morphological processing of English , 2001, Natural Language Engineering.

[44]  Sanda M. Harabagiu,et al.  COGEX: A Logic Prover for Question Answering , 2003, NAACL.

[45]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[46]  Charles L. A. Clarke,et al.  Comparing Query Formulation and Lexical Affinity Replacements in Passage Retrieval , 2022 .

[47]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[48]  Susan T. Dumais,et al.  Web-Based Question Answering: A Decision-Making Perspective , 2003, UAI.

[49]  Xiaoqiang Luo,et al.  A Statistical Model for Multilingual Entity Detection and Tracking , 2004, NAACL.

[50]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[51]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[52]  Charles L. A. Clarke,et al.  Question Answering by Passage Selection (MultiText Experiments for TREC-9) , 2000, TREC.

[53]  V. Vapnik Pattern recognition using generalized portrait method , 1963 .

[54]  Jimmy J. Lin,et al.  What Works Better for Question Answering: Stemming or Morphological Query Expansion? , 2004 .

[55]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[56]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[57]  Jimmy J. Lin,et al.  Quantitative evaluation of passage retrieval algorithms for question answering , 2003, SIGIR.

[58]  Jaime G. Carbonell,et al.  Unsupervised question answering data acquisition from local corpora , 2004, CIKM '04.

[59]  Eduard Hovy,et al.  A question/answer typology with surface text patterns , 2002 .

[60]  Adwait Ratnaparkhi,et al.  IBM's Statistical Question Answering System , 2000, TREC.

[61]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[62]  Roxana Girju Answer Fusion with On-line Ontology Development , 2001, HTL 2001.

[63]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[64]  Christof Monz,et al.  From document retrieval to question answering , 2003 .

[65]  R. Sutcliffe,et al.  A Qualitative Comparison of Scientific and Journalistic Texts from the Perspective of Extracting Definitions , 2004 .

[66]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Novelty Track. , 2005 .

[67]  Daniel Marcu,et al.  A Noisy-Channel Approach to Question Answering , 2003, ACL.

[68]  Jennifer Chu-Carroll,et al.  In Question Answering, Two Heads Are Better Than One , 2003, NAACL.

[69]  Ellen M. Voorhees,et al.  Overview of the TREC-9 Question Answering Track , 2000, TREC.

[70]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[71]  Charles L. A. Clarke,et al.  Passage retrieval vs. document retrieval for factoid question answering , 2003, SIGIR.

[72]  Jimmy J. Lin,et al.  Automatically Evaluating Answers to Definition Questions , 2005, HLT.

[73]  Jennifer Chu-Carroll,et al.  Use of WordNet Hypernyms for Answering What-Is Questions , 2001, TREC.

[74]  Graeme Hirst,et al.  Analysis of Semantic Classes in Medical Text for Question Answering , 2004 .

[75]  S. Harabagiu,et al.  Strategies for Advanced Question Answering , 2004, Workshop On Pragmatics Of Question Answering.

[76]  Susan T. Dumais,et al.  An Analysis of the AskMSR Question-Answering System , 2002, EMNLP.

[77]  Wei Li,et al.  Information Extraction Supported Question Answering , 1999, TREC.

[78]  Eduard Hovy,et al.  Knowledge-Based Question Answering , 2002 .

[79]  Jennifer Chu-Carroll,et al.  Question Answering Using Constraint Satisfaction: QA-By-Dossier-With-Contraints , 2004, ACL.

[80]  Jinxi Xu,et al.  Evaluation of an extraction-based approach to answering definitional questions , 2004, SIGIR '04.

[81]  James Allan,et al.  Using part-of-speech patterns to reduce query ambiguity , 2002, SIGIR '02.

[82]  Gideon S. Mann A Statistical Method for Short Answer Extraction , 2001, ACL 2001.

[83]  Inderjeet Mani,et al.  How to Evaluate Your Question Answering System Every Day ... and Still Get Real Work Done , 2000, LREC.

[84]  Dragomir R. Radev,et al.  The Use of Predictive Annotation for Question Answering in TREC8 , 1999, TREC.

[85]  Charles L. A. Clarke,et al.  The effect of document retrieval quality on factoid question answering performance , 2004, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[86]  Eduard H. Hovy,et al.  Toward Semantics-Based Answer Pinpointing , 2001, HLT.

[87]  Dell Zhang,et al.  Question classification using support vector machines , 2003, SIGIR.

[88]  Eduard H. Hovy,et al.  Offline Strategies for Online Question Answering: Answering Questions Before They Are Asked , 2003, ACL.

[89]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[90]  Sanda M. Harabagiu,et al.  FALCON: Boosting Knowledge for Answer Engines , 2000, TREC.

[91]  Ulf Hermjakob,et al.  Parsing and Question Classification for Question Answering , 2001, ACL 2001.

[92]  Jimmy J. Lin,et al.  Question answering from the web using knowledge annotation and knowledge mining techniques , 2003, CIKM '03.

[93]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[94]  James Dowdall,et al.  Answer Extraction from Technical Texts , 2003 .

[95]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[96]  Jimmy J. Lin,et al.  Overview of the TREC 2007 Question Answering Track , 2008, TREC.

[97]  Dan I. Moldovan,et al.  Learning Semantic Constraints for the Automatic Discovery of Part-Whole Relations , 2003, NAACL.

[98]  Ian Witten,et al.  Data Mining , 2000 .

[99]  Paul N. Bennett Using asymmetric distributions to improve text classifier probability estimates , 2003, SIGIR.

[100]  Tsuneaki Kato,et al.  Question Answering Challenge (QAC-1): An Evaluation of Question Answering Tasks at the NTCIR Workshop 3 , 2003, New Directions in Question Answering.

[101]  Graeme Hirst,et al.  Answering Clinical Questions with Role Identification , 2003, BioNLP@ACL.

[102]  Dragomir R. Radev,et al.  Question-answering by predictive annotation , 2000, SIGIR '00.

[103]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[104]  David G. Stork,et al.  Pattern Classification , 1973 .

[105]  Maarten de Rijke,et al.  The Multiple Language Question Answering Track at CLEF 2003 , 2003, CLEF.

[106]  Salim Roukos,et al.  Automatic Derivation of Surface Text Patterns for a Maximum Entropy Based Question Answering System , 2003, NAACL.

[107]  Gideon S. Mann Fine-Grained Proper Noun Ontologies for Question Answering , 2002, COLING 2002.

[108]  Eric Brill,et al.  Automatic Question Answering: Beyond the Factoid , 2004, NAACL.

[109]  Sanda M. Harabagiu,et al.  Intentions, Implicatures and Processing of Complex Questions , 2004, HLT-NAACL 2004.

[110]  Salim Roukos,et al.  IBM's Statistical Question Answering System-TREC 11 , 2001, TREC.

[111]  Stephen J. Green,et al.  Aggressive Morphology and Lexical Relations for Query Expansion , 2001, TREC.

[112]  Eduard H. Hovy,et al.  Question Answering in Webclopedia , 2000, TREC.

[113]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[114]  Jimmy J. Lin,et al.  Web question answering: is more always better? , 2002, SIGIR '02.

[115]  Eduard H. Hovy,et al.  The Use of External Knowledge of Factoid QA , 2001, TREC.

[116]  Daniel Marcu,et al.  Natural Language Based Reformulation Resource and Wide Exploitation for Question Answering , 2002, TREC.

[117]  Charles L. A. Clarke,et al.  Statistical Selection of Exact Answers (MultiText Experiments for TREC 2002) , 2002, TREC.

[118]  Martin M. Soubbotin Patterns of Potential Answer Expressions as Clues to the Right Answers , 2001, TREC.

[119]  Tat-Seng Chua,et al.  Generic soft pattern models for definitional question answering , 2005, SIGIR '05.

[120]  Hong Yu,et al.  A Cognitive Evaluation of Four Online Search Engines for Answering Definitional Questions Posed by Physicians , 2007, Pacific Symposium on Biocomputing.

[121]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[122]  Lucian Vlad Lita,et al.  Multi-Strategy Information Extraction for Question Answering , 2005, FLAIRS.

[123]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[124]  Gideon S. Mann Learning How to Answer Questions Using Trivia Games , 2002, COLING.

[125]  W. Bruce Croft,et al.  Relevance Feedback and Personalization: A Language Modeling Perspective , 2001, DELOS.