Declarative Cleaning of Inconsistencies in Information Extraction

The population of a predefined relational schema from textual content, commonly known as Information Extraction (IE), is a pervasive task in contemporary computational challenges associated with Big Data. Since the textual content varies widely in nature and structure (from machine logs to informal natural language), it is notoriously difficult to write IE programs that unambiguously extract the sought information. For example, during extraction, an IE program could annotate a substring as both an address and a person name. When this happens, the extracted information is said to be inconsistent, and some way of removing inconsistencies is crucial to compute the final output. Industrial-strength IE systems like GATE and IBM SystemT therefore provide a built-in collection of cleaning operations to remove inconsistencies from extracted relations. These operations, however, are collected in an ad hoc fashion through use cases. Ideally, we would like to allow IE developers to declare their own policies. But existing cleaning operations are defined in an algorithmic way, and hence it is not clear how to extend the built-in operations without requiring low-level coding of internal or external functions. We embark on the establishment of a framework for declarative cleaning of inconsistencies in IE through principles of database theory. Specifically, building upon the formalism of document spanners for IE, we adopt the concept of prioritized repairs, which has been recently proposed as an extension of the traditional database repairs to incorporate priorities among conflicting facts. We show that our framework captures the popular cleaning policies, as well as the POSIX semantics for extraction through regular expressions. We explore the problem of determining whether a cleaning declaration is unambiguous (i.e., always results in a single repair) and whether it increases the expressive power of the extraction language. We give both positive and negative results, some of which are general and some of which apply to policies used in practice.

[1]  RONALD FAGIN,et al.  Document Spanners , 2015, J. ACM.

[2]  Lukasz Golab,et al.  Sampling from repairs of conditional functional dependency violations , 2014, The VLDB Journal.

[3]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[4]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[5]  Martin Theobald,et al.  A Temporal-Probabilistic Database Model for Information Extraction , 2013, Proc. VLDB Endow..

[6]  Frederick Reiss,et al.  The SystemT IDE: an integrated development environment for information extraction rules , 2011, SIGMOD '11.

[7]  Christopher Ré,et al.  Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS , 2011, Proc. VLDB Endow..

[8]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[9]  Tim Leek,et al.  Information Extraction Using Hidden Markov Models , 1997 .

[10]  Frederick Reiss,et al.  Automatic rule refinement for information extraction , 2010, Proc. VLDB Endow..

[11]  Laks V. S. Lakshmanan,et al.  Data Cleaning and Query Answering with Matching Dependencies and Matching Functions , 2010, ICDT '11.

[12]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[13]  Diana Maynard,et al.  JAPE: a Java Annotation Patterns Engine , 2000 .

[14]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[15]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[16]  Jan Chomicki,et al.  Prioritized repairing and consistent query answering in relational databases , 2012, Annals of Mathematics and Artificial Intelligence.

[17]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[18]  Sriram Raghavan,et al.  Rewrite rules for search database systems , 2011, PODS.

[19]  Frederick Reiss,et al.  SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[20]  Shuai Ma,et al.  Extending inclusion dependencies with conditions , 2014, Theor. Comput. Sci..

[21]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[22]  Martin Kutrib,et al.  Multi-Head Finite Automata: Characterizations, Concepts and Open Problems , 2009, CSP.

[23]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[24]  Frederick Reiss,et al.  Provenance-based dictionary refinement in information extraction , 2013, SIGMOD '13.

[25]  Kazem Taghva,et al.  Address extraction using hidden Markov models , 2005, IS&T/SPIE Electronic Imaging.

[26]  Regina Barzilay,et al.  Event Discovery in Social Media Feeds , 2011, ACL.

[27]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[28]  Son Doan,et al.  Application of information technology: MedEx: a medication information extraction system for clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[29]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[30]  Douglas E. Appelt,et al.  The Common Pattern Specification Language , 1998, TIPSTER.

[31]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[32]  Taro Suzuki,et al.  Disambiguation in Regular Expression Matching via Position Automata with Augmented Transitions , 2010, CIAA.

[33]  Ville Laurikari,et al.  Efficient submatch addressing for regular expressions , 2001 .

[34]  Meena Nagarajan,et al.  A CRM system for social media: challenges and experiences , 2013, WWW.

[35]  Frederick Reiss,et al.  An Algebraic Approach to Rule-Based Information Extraction , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[36]  Stijn Vansummeren,et al.  Type inference for unique pattern matching , 2006, TOPL.

[37]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[38]  Jianzhong Li,et al.  The VLDB Journal manuscript No. (will be inserted by the editor) Dynamic Constraints for Record Matching , 2022 .

[39]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[40]  Frederick Reiss,et al.  Cleaning inconsistencies in information extraction via prioritized repairs , 2014, PODS.

[41]  Dayne Freitag,et al.  Toward General-Purpose Learning for Information Extraction , 1998, ACL.

[42]  Gerald DeJong,et al.  An Overview of the FRUMP System Introduction , 2014 .

[43]  Frederick Reiss,et al.  Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! , 2013, EMNLP.

[44]  Shuai Ma,et al.  Interaction between Record Matching and Data Repairing , 2014, JDIQ.

[45]  Sriram Raghavan,et al.  Navigating the intranet with high precision , 2007, WWW '07.

[46]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.