An unsupervised method for the extraction of propositional information from text

Recent developments in question-answering systems have demonstrated that approaches based on propositional analysis of source text, in conjunction with formal inference systems, can produce substantive improvements in performance over surface-form approaches. [Voorhees, E. M. (2002) in Eleventh Text Retrieval Conference, eds. Voorhees, E. M. & Buckland, L. P., http://trec.nist.gov/pubs/trec11/t11_proceedings.html]. However, such systems are hampered by the need to create broad-coverage knowledge bases by hand, making them difficult to adapt to new domains and potentially fragile if critical information is omitted. To demonstrate how this problem might be addressed, the Syntagmatic Paradigmatic model, a memory-based account of sentence processing, is used to autonomously extract propositional knowledge from unannotated text. The Syntagmatic Paradigmatic model assumes that people store a large number of sentence instances. When trying to interpret a new sentence, similar sentences are retrieved from memory and aligned with the new sentence by using String Edit Theory. The set of alignments can be considered an extensional interpretation of the sentence. Extracting propositional information in this way not only permits the model to answer questions for which the relevant facts are explicitly stated in the text but also allows the model to take advantage of “inference by coincidence,” where implicit inference occurs as an emergent property of the mechanism. To illustrate the potential of this approach, the model is tested for its ability to determine the winners of tennis matches as reported on the Association of Tennis Professionals web site.

[1]  Van Valin,et al.  Advances in role and reference grammar , 1992 .

[2]  Ido Dagan,et al.  Proceedings of the 24th Conference on Computational Natural Language Learning , 2005 .

[3]  C. S. Wallace,et al.  Finite-state models in the alignment of macromolecules , 1992, Journal of Molecular Evolution.

[4]  Yorick Wilks,et al.  Information Extraction: Beyond Document Retrieval , 1998, Int. J. Comput. Linguistics Chin. Lang. Process..

[5]  Mitchell P. Marcus Proceedings of the second international conference on Human Language Technology Research , 2002 .

[6]  Peter H. Sellers,et al.  An Algorithm for the Distance Between Two Finite Sequences , 1974, J. Comb. Theory, Ser. A.

[7]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[8]  Michael S. Humphreys,et al.  A Context Noise Model of Episodic Recognition Memory , 2001 .

[9]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[10]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[11]  John E. Hummel,et al.  Distributed representations of structure: A theory of analogical access and mapping. , 1997 .

[12]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[13]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[14]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[15]  Joseph B. Kruskal,et al.  Time Warps, String Edits, and Macromolecules , 1999 .

[16]  Michael S. Waterman,et al.  General methods of sequence comparison , 1984 .

[17]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.