Machine Learning for Question Answering from Tabular Data

Question Answering (QA) systems automatically answer natural language questions in a human-like manner. One of the practical approaches to open domain QA consists in extracting facts from free text offline and using a lookup mechanism when answering user's questions online. This approach is related to natural language interfaces to databases (NLIDBs) that were studied extensively from the 1970s to the 1990s. NLIDB systems employed a range of techniques, from simple pattern-matching rules to formal logical calculi such as the lambda calculus, but most were restricted to specific domains. In this paper we describe a machine learning approach to querying tabular data for QA which is not restricted to specific domains. Our approach consists of two steps: for an incoming question, we first use a classifier to identify appropriate tables and columns in a structured database, and then employ a free-text retrieval to look up answers. The system uses part-of-speech tagging, named-entity normalization and a statistical classifier trained on data from the TREC QA task. With the TREC QA data, our system is shown to significantly outperform an existing rule-based table lookup method.

[1]  Peter Thanisch,et al.  Natural language interfaces to databases – an introduction , 1995, Natural Language Engineering.

[2]  Jimmy J. Lin,et al.  What Works Better for Question Answering: Stemming or Morphological Query Expansion? , 2004 .

[3]  Eduard H. Hovy,et al.  Offline Strategies for Online Question Answering: Answering Questions Before They Are Asked , 2003, ACL.

[4]  Donna K. Harman,et al.  Overview of the TREC 2002 Novelty Track , 2002, TREC.

[5]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[6]  Christof Monz,et al.  From document retrieval to question answering , 2003 .

[7]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Evaluation , 2000, TREC.

[8]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[9]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[10]  Valentin Jijkoun,et al.  Information Extraction for Question Answering: Improving Recall Through Syntactic Patterns , 2004, COLING.

[11]  Matteo Negri,et al.  Sense-based Blind Relevance Feedback for Question Answering , 2004 .

[12]  Valentin Jijkoun,et al.  Towards an Offline XML-Based Strategy for Answering Questions , 2005, CLEF.

[13]  W. Bruce Croft,et al.  Analysis of Statistical Question Classification for Fact-Based Questions , 2005, Information Retrieval.

[14]  James Allan,et al.  Retrieval and novelty detection at the sentence level , 2003, SIGIR.

[15]  Pushpak Bhattacharyya,et al.  Is question answering an acquired skill? , 2004, WWW '04.

[16]  Antal van den Bosch Wrapped progressive sampling search for optimizing learning algorithm parameters , 2005 .

[17]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.