Fact distribution in Information Extraction

Several recent Information Extraction (IE) systems have been restricted to the identification facts which are described within a single sentence. It is not clear what effect this has on the difficulty of the extraction task or how the performance of systems which consider only single sentences should be compared with those which consider multiple sentences. This paper compares three IE evaluation corpora, from the Message Understanding Conferences, and finds that a significant proportion of the facts mentioned therein are not described within a single sentence. Therefore systems which are evaluated only on facts described within single sentences are being tested against a limited portion of the relevant information in the text and it is difficult to compare their performance with other systems. Further analysis demonstrates that anaphora resolution and world knowledge are required to combine information described across multiple sentences. This result has implications for the development and evaluation of IE systems.

[1]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[2]  Beth Sundheim,et al.  Overview of the Third Message Understanding Evaluation and Conference , 1991, MUC.

[3]  Edith Bolling Anaphora Resolution , 2006 .

[4]  Lynette Hirschman An adjunct test for discourse processing in MUC-4 , 1992, MUC.

[5]  Mark Stevenson,et al.  A Semantic Approach to IE Pattern Induction , 2005, ACL.

[6]  Mark Stevenson Information Extraction from Single and Multiple Sentences , 2004, COLING.

[7]  Ralph Grishman,et al.  Automatic Acquisition of Domain Knowledge for Information Extraction , 2000, COLING.

[8]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[9]  Ralph Grishman,et al.  Complexity of Event Structure in IE Scenarios , 2002, COLING.

[10]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[11]  Satoshi Sekine,et al.  On-Demand Information Extraction , 2006, ACL.

[12]  Dmitry Zelenko,et al.  Kernel Methods for Relation Extraction , 2002, J. Mach. Learn. Res..

[13]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[14]  Alan W. Biermann,et al.  Analyzing the Complexity of a Domain with Respect to an Information Extraction Task , 1997, MUC.

[15]  Ruslan Mitkov,et al.  The Oxford handbook of computational linguistics , 2003 .

[16]  Hwee Tou Ng,et al.  A maximum entropy approach to information extraction from semi-structured and free text , 2002, AAAI/IAAI.

[17]  Marc Moens,et al.  LT TTT - A Flexible Tokenisation Tool , 2000, LREC.

[18]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[19]  Geoffrey Sampson,et al.  The Oxford Handbook of Computational Linguistics , 2003, Lit. Linguistic Comput..