High-performance, open-domain question answering from large text collections

Vast amounts of information, covering virtually every topic of interest, are now electronically accessible in the form of internal text databases of commercial institutions, encyclopedia, newswire services, or in the form of the unstructured, continuously growing World Wide Web. Situated at the frontier of information retrieval and natural language processing, open-domain question answering is a challenging task, involving the extraction of brief, relevant answer strings from large text collections, in response to users' questions. The design of novel, robust models for capturing the semantics of natural language questions, finding relevant text snippets, and selecting the most relevant answer when several candidates have been identified, is essential for high-precision question answering. A relational representation encodes lexical, relational and semantic information in an integrated model applying to both questions and candidate answers. The representation impacts all stages of question answering, including question processing—when the category of the expected answers is detected, passage retrieval—when relevant passages are identified in the text collection, and answer extraction—when the actual answers are found. The theoretical contributions of the thesis are reflected in a fully-implemented architecture, whose performance was evaluated within the Question Answering track of the DARPA-sponsored Text REtrieval Conference (TREC). The theoretical concepts developed in the thesis are instrumental to the extraction of correct answers as response to a test set of 893 fact-seeking questions from a 3 Gigabyte text collection. Experimental results also show important qualitative improvements with respect to output from Web search engines, and unveil some of the challenges and desired features of next-generation text search technologies.