Joint unsupervised structure discovery and information extraction

In this paper we present JUDIE (Joint Unsupervised Structure Discovery and Information Extraction), a new method for automatically extracting semi-structured data records in the form of continuous text (e.g., bibliographic citations, postal addresses, classified ads, etc.) and having no explicit delimiters between them. While in state-of-the-art Information Extraction methods the structure of the data records is manually supplied the by user as a training step, JUDIE is capable of detecting the structure of each individual record being extracted without any user assistance. This is accomplished by a novel Structure Discovery algorithm that, given a sequence of labels representing attributes assigned to potential values, groups these labels into individual records by looking for frequent patterns of label repetitions among the given sequence. We also show how to integrate this algorithm in the information extraction process by means of successive refinement steps that alternate information extraction and structure discovery. Through an extensively experimental evaluation with different datasets in distinct domains, we compare JUDIE with state-of-the-art information extraction methods and conclude that, even without any user intervention, it is able to achieve high quality results on the tasks of discovering the structure of the records and extracting information from them.

[1]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[2]  Aristides Gionis,et al.  Automated Ranking of Database Query Results , 2003, CIDR.

[3]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[4]  Jeffrey F. Naughton,et al.  Information extraction challenges in managing unstructured data , 2009, SGMD.

[5]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[6]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[7]  Eugene Agichtein,et al.  Mining reference tables for automatic text segmentation , 2004, KDD.

[8]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[9]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[10]  Juliana Freire,et al.  Using latent-structure to detect objects on the web , 2010, WebDB '10.

[11]  Edleno Silva de Moura,et al.  LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfaces , 2007, Inf. Process. Manag..

[12]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[13]  Sunita Sarawagi,et al.  Integrating Unstructured Data into Relational Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[14]  Marcos André Gonçalves,et al.  ONDUX: on-demand unsupervised learning for information extraction , 2010, SIGMOD Conference.

[15]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[16]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[17]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[18]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[19]  I. V. Ramakrishnan,et al.  Exploiting Structured Reference Data for Unsupervised Text Segmentation with Conditional Random Fields , 2008, SDM.

[20]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.