Mining the web for answers to natural language questions

The web is now becoming one of the largest information and knowledge repositories. Many large scale search engines (Google, Fast, Northern Light, etc.) have emerged to help users find information. In this paper, we study how we can effectively use these existing search engines to mine the Web and discover the "correct" answers to factual natural language questions.We propose a probabilistic algorithm called QASM (Question Answering using Statistical Models) that learns the best query paraphrase of a natural language question. We validate our approach for both local and web search engines using questions from the TREC evaluation. We also show how this algorithm can be combined with another algorithm (AnSel) to produce precise answers to natural language questions.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, Applied Natural Language Processing Conference.

[3]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[4]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[5]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[6]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[7]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[8]  John D. Lafferty,et al.  The Candide System for Machine Translation , 1994, HLT.

[9]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[10]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[11]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[12]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[13]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Evaluation , 2000, TREC.

[14]  Daniel Marcu,et al.  Statistics-Based Summarization - Step One: Sentence Compression , 2000, AAAI/IAAI.

[15]  Dragomir R. Radev,et al.  Ranking suspected answers to natural language questions using predictive annotation , 2000, ANLP.

[16]  Michele Banko,et al.  Headline Generation Based on Statistical Translation , 2000, ACL.

[17]  Sanda M. Harabagiu,et al.  The Structure and Performance of an Open-Domain Question Answering System , 2000, ACL.

[18]  Dragomir R. Radev,et al.  Question-answering by predictive annotation , 2000, SIGIR '00.

[19]  Andrei Mikheev,et al.  Tagging Sentence Boundaries , 2000, ANLP.

[20]  Michael D. Gordon,et al.  Web Search---Your Way , 2001, CACM.

[21]  William P. Birmingham,et al.  Improving category specific Web search by learning query modifications , 2001, Proceedings 2001 Symposium on Applications and the Internet.

[22]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[23]  Sergei Nirenburg,et al.  A Statistical Approach to Machine Translation , 2003 .