INDREX: in-database distributional relation extraction

Relation extraction transforms the textual representation of a relationship into the relational model of a data warehouse. Early systems, such as SystemT by IBM or the open source system GATE solve this task with handcrafted rule sets that the system executes document-by-document. Thereby the user must execute a highly interactive and iterative process of reading a document, of expressing rules, of testing these rules on the next document and of refining rules. Until now, these systems do neither leverage the full potential of built-in declarative query languages nor the indexing and query optimization techniques of a modern RDBMS that would enable a user interactive rule refinement across documents and on the entire corpus. We propose the INDREX system that enables a user for the first time to describe corpus-wide extraction tasks in a declarative language and permits the user to run interactive rule refinement queries. For enabling this powerful functionality we extend a standard PostgreSQL with a set of white-box user-defined functions that enable corpus-wide transformations from sentences into relationships. We store the text corpus and rules in the same RDBMS that already holds domain specific structured data. As a result, (1) the user can leverage this data to further adapt rules to the target domain, (2) the user does not need an additional system for rule extraction and (3) the INDREX system can leverage the full power of built-in indexing and query optimization techniques of the underlaying RDBMS. In a preliminary study we report on the feasibility of this disruptive approach and show multiple queries in INDREX on the Reuters Corpus, Volume 1.

[1]  Jignesh M. Patel,et al.  Column-Oriented Storage Techniques for MapReduce , 2011, Proc. VLDB Endow..

[2]  B. Bloom,et al.  Taxonomy of Educational Objectives. Handbook I: Cognitive Domain , 1966 .

[3]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[4]  Ihab F. Ilyas,et al.  Just-in-time information extraction using extraction views , 2012, SIGMOD Conference.

[5]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[6]  Slav Petrov,et al.  Using Search-Logs to Improve Query Tagging , 2012, ACL.

[7]  Alexander Löser,et al.  The GoOLAP Fact Retrieval Framework , 2011, eBISS.

[8]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[9]  Alexander Löser,et al.  Effective Selectional Restrictions for Unsupervised Relation Extraction , 2013, IJCNLP.

[10]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[11]  Gerhard Weikum,et al.  PATTY: A Taxonomy of Relational Patterns with Semantic Types , 2012, EMNLP.

[12]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[13]  Alin Deutsch,et al.  Score-consistent algebraic optimization of full-text search queries with GRAFT , 2011, SIGMOD '11.

[14]  Christian Mathis,et al.  Data management with SAPs in-memory computing engine , 2012, EDBT '12.

[15]  Christopher Ré,et al.  Brainwash: A Data System for Feature Engineering , 2013, CIDR.

[16]  Alexander Löser,et al.  Beyond search: Retrieving complete tuples from a text-database , 2013, Inf. Syst. Frontiers.

[17]  Luis Gravano,et al.  Optimizing SQL Queries over Text Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[18]  Ralph Grishman,et al.  Active learning for relation type extension with local and global data views , 2012, CIKM '12.

[19]  Tran Cao Son,et al.  Incremental Information Extraction Using Relational Databases , 2012, IEEE Transactions on Knowledge and Data Engineering.

[20]  Frederick Reiss,et al.  The SystemT IDE: an integrated development environment for information extraction rules , 2011, SIGMOD '11.

[21]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[22]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[23]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[24]  E. F. Codd,et al.  Extending the database relational model to capture more meaning , 1979, ACM Trans. Database Syst..

[25]  Gary Marchionini,et al.  Exploratory search , 2006, Commun. ACM.

[26]  Jun Yang,et al.  Efficient Information Extraction over Evolving Text Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[27]  Luciano Del Corro,et al.  ClausIE: clause-based open information extraction , 2013, WWW.

[28]  Alexander Löser,et al.  Unsupervised Discovery of Relations and Discriminative Extraction Patterns , 2012, COLING.

[29]  Frederick Reiss,et al.  SystemT: a system for declarative information extraction , 2009, SGMD.

[30]  Felice Dell'Orletta,et al.  Multilingual Dependency Parsing and Domain Adaptation using DeSR , 2007, EMNLP.

[31]  Alexander Löser,et al.  KrakeN: N-ary Facts in Open Information Extraction , 2012, AKBC-WEKEX@NAACL-HLT.