Information extraction as a filtering task

Information extraction is usually approached as an annotation task: Input texts run through several analysis steps of an extraction process in which different semantic concepts are annotated and matched against the slots of templates. We argue that such an approach lacks an efficient control of the input of the analysis steps. In this paper, we hence propose and evaluate a model and a formal approach that consistently put the filtering view in the focus: Before spending annotation effort, filter those portions of the input texts that may contain relevant information for filling a template and discard the others. We model all dependencies between the semantic concepts sought for with a truth maintenance system, which then efficiently infers the portions of text to be annotated in each analysis step. The filtering view enables an information extraction system (1) to annotate only relevant portions of input texts and (2) to easily trade its run-time efficiency for its recall. We provide our approach as an open-source extension of Apache UIMA and we show the potential of our approach in a number of experiments.

[1]  Manaal Faruqui,et al.  Training and Evaluating a German Named Entity Recognizer with Semantic Generalization , 2010, KONVENS.

[2]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[3]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[4]  Andrew Y. Ng,et al.  Solving the Problem of Cascading Errors: Approximate Bayesian Inference for Linguistic Annotation Pipelines , 2006, EMNLP.

[5]  Frederick Reiss,et al.  SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[6]  Benno Stein,et al.  Constructing efficient information extraction pipelines , 2011, CIKM '11.

[7]  Steven Skiena,et al.  SpeedRead: A Fast Named Entity Recognition Pipeline , 2012, COLING.

[8]  Lyle H. Ungar,et al.  Web-scale named entity recognition , 2008, CIKM '08.

[9]  Luis Gravano,et al.  Querying text databases for efficient information extraction , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[10]  Gregor Engels,et al.  Automatic Pipeline Construction for Real-Time Annotation , 2013, CICLing.

[11]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[12]  Claire Nedellec,et al.  Sentence Filtering for Information Extraction in Genomics, a Classification Problem , 2001, PKDD.

[13]  Tat-Seng Chua,et al.  Question answering passage retrieval using dependency relations , 2005, SIGIR '05.

[14]  Benno Stein,et al.  Efficient Statement Identification for Automatic Market Forecasting , 2010, COLING.

[15]  Romaric Besançon,et al.  Filtering and clustering relations for unsupervised information extraction in open domain , 2011, CIKM '11.

[16]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[17]  Jeffrey F. Naughton,et al.  Information extraction challenges in managing unstructured data , 2009, SGMD.

[18]  Raghu Ramakrishnan,et al.  Managing information extraction: state of the art and research directions , 2006, SIGMOD Conference.

[19]  Doug Downey,et al.  KnowItNow: Fast, Scalable Information Extraction from the Web , 2005, HLT.

[20]  Benno Stein,et al.  AUTOMATING MARKET FORECAST SUMMARIZATION FROM INTERNET DATA , 2009 .

[21]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[22]  George Forman,et al.  Extremely fast text feature extraction for classification and indexing , 2008, CIKM '08.

[23]  Kalina Bontcheva,et al.  Text Processing with GATE , 2011 .

[24]  Mark Stevenson Fact distribution in Information Extraction , 2006, Lang. Resour. Evaluation.

[25]  Oren Kurland,et al.  Predicting the performance of passage retrieval for question answering , 2012, CIKM.

[26]  ChinchorNancy,et al.  Evaluating message understanding systems , 1993 .

[27]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[28]  Eugene Agichtein Scaling Information Extraction to Large Document Collections , 2005, IEEE Data Eng. Bull..

[29]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[30]  Eduard Hovy,et al.  Towards terascale knowledge acquisition , 2004, COLING 2004.

[31]  Anish Das Sarma,et al.  Building a generic debugger for information extraction pipelines , 2011, CIKM '11.

[32]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[33]  Claire Cardie,et al.  Examining the Role of Statistical and Linguistic Knowledge Sources in a General-Knowledge Question-Answering System , 2000, ANLP.

[34]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[35]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[36]  Dan Roth,et al.  Automatic Event Extraction with Structured Preference Modeling , 2012, ACL.

[37]  Siddharth Patwardhan,et al.  Effective Information Extraction with Semantic Affinity Patterns and Relevant Regions , 2007, EMNLP.

[38]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[39]  Romaric Besançon,et al.  Text Segmentation and Graph-based Method for Template Filling in Information Extraction , 2011, IJCNLP.

[40]  Benno Stein,et al.  Optimal Scheduling of Information Extraction Algorithms , 2012, COLING.

[41]  Marius Pasca Web-based open-domain information extraction , 2011, CIKM '11.

[42]  Barbara Hayes-Roth,et al.  A Blackboard Architecture for Control , 1985, Artif. Intell..

[43]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[44]  Brian Roark,et al.  Pipeline Iteration , 2007, ACL.

[45]  Lynette Hirschman,et al.  Evaluating Message Understanding Systems: An Analysis of the Third Message Understanding Conference (MUC-3) , 1993, CL.

[46]  Henning Wac,et al.  Optimal Scheduling of Information Extraction Algorithms , 2012 .

[47]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[48]  Andrew Tomkins,et al.  How to build a WebFountain: An architecture for very large-scale text analytics , 2004, IBM Syst. J..