Multi-document statistical fact extraction and fusion

This dissertation presents original techniques for statistical fact extraction and fusion from multiple documents. Fact extraction, or relationship extraction, is a process where natural language text is scanned to find instances of a predetermined class of facts (e.g. birthday(x,y)). A framework for training statistical fact extractors from example is used wherein a set of examples and a target model are used to annotate an automatically collected corpus. This annotation is then used to provide training data for classifiers (Phrase Conditional Likelihood and Native Bayes) or sequence models (Conditional Random Fields). Fact extractors are used in two information retrieval tasks. In question answering the set of candidate answers is narrowed using fine-grained proper noun ontological facts (is-a(X, Y)) extracted from a corpus by rote classifiers leading to higher performance. Extracted facts are also used for name-referent disambiguation, or cross-document coreference, where one personal name may refer to multiple potential people in the world. The distinguishing biographic facts for each person, such as birthday(x,y) and occupation (x,y), are automatically extracted from plain text and these biographic facts are used along with other statistical methods to distinguish between mentions of each of the referents. This dissertation presents novel techniques for fusion which integrate facts extracted from multiple sources. For the task of biographic fact extraction, fusion of factual information extracted from multiple documents improves the precision of the resulting information. Further improvements result from cascaded fact extraction, where certain facts are extracted and fused and then these facts are used to extract additional information. The technique of cascaded fact extraction and fusion is also applied to time-bounded facts, where a cascade of fact extractors produce a timeline of corporate management succession. Collectively, this research demonstrates the utility of multi-document fact extraction and fusion. It shows that facts can serve as a building-block for deeper text processing such as finding coreferent names in a series of documents, finding the answers to questions, and constructing a timeline for time-variable facts. The key aspects to the process are training with minimal supervision, high-performance statistical fact extraction, fusion across multiple sources of information, and cascaded extraction.