Split-Correctness in Information Extraction

Programs for extracting structured information from text, namely information extractors, often operate separately on document segments obtained from a generic splitting operation such as sentences, paragraphs, k-grams, HTTP requests, and so on. An automated detection of this behavior of extractors, which we refer to as split-correctness, would allow text analysis systems to devise query plans with parallel evaluation on segments for accelerating the processing of large documents. Other applications include the incremental evaluation on dynamic content, where re-evaluation of information extractors can be restricted to revised segments, and debugging, where developers of information extractors are informed about potential boundary crossing of different semantic components. We propose a new formal framework for split-correctness within the formalism of document spanners. Our preliminary analysis studies the complexity of split-correctness over regular spanners. We also discuss different variants of split-correctness, for instance, in the presence of black-box extractors with "split constraints".

[1]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[2]  Stijn Vansummeren,et al.  Constant Delay Algorithms for Regular Document Spanners , 2018, PODS.

[3]  Thomas Schwentick,et al.  Complexity of Decision Problems for XML Schemas and Chain Regular Expressions , 2009, SIAM J. Comput..

[4]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[5]  Hwee Tou Ng,et al.  A Machine Learning Approach to Coreference Resolution of Noun Phrases , 2001, CL.

[6]  Benny Kimelfeld,et al.  Joining Extractions of Regular Expressions , 2017, PODS.

[7]  Frederick Reiss,et al.  A Relational Framework for Information Extraction , 2016, SGMD.

[8]  Dominik D. Freydenberger,et al.  Document Spanners: From Expressive Power to Decision Problems , 2016, ICDT.

[9]  RONALD FAGIN,et al.  Document Spanners , 2015, J. ACM.

[10]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[11]  Ganesh Ramakrishnan,et al.  Numerical Relation Extraction with Minimal Supervision , 2016, AAAI.

[12]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[13]  Georg Gottlob,et al.  Distributed XML Design , 2011, J. Comput. Syst. Sci..

[14]  Claudio Giuliano,et al.  Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature , 2006, EACL.

[15]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[16]  Thomas Schwentick,et al.  Parallel-Correctness and Transferability for Conjunctive Queries , 2014, J. ACM.

[17]  Cristian Riveros,et al.  Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity , 2018, PODS.

[18]  Harry B. Hunt,et al.  On the Equivalence and Containment Problems for Unambiguous Regular Expressions, Regular Grammars and Finite Automata , 1985, SIAM J. Comput..

[19]  Thomas Schwentick,et al.  Schema design for XML repositories: complexity and tractability , 2010, PODS '10.

[20]  Frederick Reiss,et al.  Declarative Cleaning of Inconsistencies in Information Extraction , 2016, TODS.

[21]  Frederick Reiss,et al.  SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[22]  Dong-Hong Ji,et al.  Unsupervised Feature Selection for Relation Extraction , 2005, IJCNLP.

[23]  Christopher De Sa,et al.  DeepDive: Declarative Knowledge Base Construction , 2016, SGMD.

[24]  Heeyoung Lee,et al.  Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task , 2011, CoNLL Shared Task.

[25]  Dominik D. Freydenberger A Logic for Document Spanners , 2018, Theory of Computing Systems.

[26]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[27]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, The VLDB Journal.

[28]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[29]  Dexter Kozen,et al.  Lower bounds for natural proof systems , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[30]  Jun Zhao,et al.  Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks , 2015, EMNLP.

[31]  Heeyoung Lee,et al.  A Multi-Pass Sieve for Coreference Resolution , 2010, EMNLP.

[32]  Dominik D. Freydenberger,et al.  Document Spanners: From Expressive Power to Decision Problems , 2017, Theory of Computing Systems.

[33]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[34]  Reynold Xin,et al.  Apache Spark , 2016 .