Declarative Information Extraction in a Probabilistic Database System

Full-text documents represent a large fraction of the world’s data. Although not structured per se, they often contain snippets of structured information within them: e.g., names, addresses, and document titles. Information Extraction (IE) techniques identify such structured information in text. In recent years, database research has pursued IE on two fronts: declarative languages and systems for managing IE tasks, and IE as an uncertain data source for Probabilistic Databases. It is natural to consider merging these two directions, but efforts to do so have had to compromise on the statistical robustness of IE algorithms in order to fit with early Probabilistic Database models. In this paper, we bridge the gap between these ideas by implementing a state-of-the-art statistical IE approach – Conditional Random Fields (CRFs) – in the setting of Probabilistic Databases that treat statistical models as first-class data objects. Using standard relational tables to capture CRF parameters, and inverted-file representations of text, we show that the Viterbi algorithm for CRF inference can be specified declaratively in recursive SQL, in a manner that can both choose likely segmentations, and provide detailed marginal distributions for label assignment. Given this implementation, we propose query processing optimizations that effectively combine probabilistic inference and relational operators such as selections and joins. In an experimental study with two data sets, we demonstrate the efficiency of our in-database Viterbi implementation in PostgreSQL relative to an open-source CRF library, and show the performance benefits of our optimizations.

[1]  Frederick Reiss,et al.  An Algebraic Approach to Rule-Based Information Extraction , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[2]  Sunita Sarawagi,et al.  Scalable Information Extraction and Integration. , 2006 .

[3]  Raghu Ramakrishnan,et al.  Managing information extraction: state of the art and research directions , 2006, SIGMOD Conference.

[4]  Andrew McCallum,et al.  Confidence Estimation for Information Extraction , 2004, NAACL.

[5]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[6]  Mark Craven,et al.  Hierarchical Hidden Markov Models for Information Extraction , 2003, IJCAI.

[7]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[8]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[9]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[10]  Jennifer Widom,et al.  Databases with uncertainty and lineage , 2008, The VLDB Journal.

[11]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[12]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[13]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[14]  Noah A. Smith,et al.  Compiling Comp Ling: Weighted Dynamic Programming and the Dyna Language , 2005, HLT.

[15]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[16]  Dan Klein,et al.  Unsupervised Learning of Field Segmentation Models for Information Extraction , 2005, ACL.

[17]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[18]  Michael Pittarelli,et al.  The Theory of Probabilistic Databases , 1987, VLDB.

[19]  Samuel Madden,et al.  MauveDB: supporting model-based user views in database systems , 2006, SIGMOD Conference.

[20]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..

[21]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[22]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[23]  Pedro M. Domingos,et al.  Joint Unsupervised Coreference Resolution with Markov Logic , 2008, EMNLP.

[24]  Paul A. Viola,et al.  Interactive Information Extraction with Constrained Conditional Random Fields , 2004, AAAI.

[25]  Leonid Peshkin,et al.  Bayesian Information Extraction Network , 2003, IJCAI.