Uncertainty management in rule-based information extraction systems

Rule-based information extraction is a process by which structured objects are extracted from text based on user-defined rules. The compositional nature of rule-based information extraction also allows rules to be expressed over previously extracted objects. Such extraction is inherently uncertain, due to the varying precision associated with the rules used in a specific extraction task. Quantifying this uncertainty is crucial for querying the extracted objects in probabilistic databases, and for improving the recall of extraction tasks that use compositional rules. In this paper, we provide a probabilistic framework for handling the uncertainty in rule-based information extraction. Specifically, for each extraction task, we build a parametric exponential model of uncertainty that captures the interaction between the different rules, as well as the compositional nature of the rules; the exponential form of our model follows from maximum-entropy considerations. We also give model-decomposition techniques that make the learning algorithms scalable to large numbers of rules and constraints. Experiments over multiple real-world extraction tasks confirm that our approach yields accurate probability estimates with only a small performance overhead. Moreover, our framework supports incremental pay-as-you-go improvements in the accuracy of probability estimates as new rules, data, or constraints are added.

[1]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[2]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[3]  Warren R. Greiff,et al.  The maximum entropy approach and probabilistic IR models , 2000, TOIS.

[4]  Stuart J. Russell,et al.  First-Order Probabilistic Models for Information Extraction , 2003 .

[5]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[6]  Wendy G. Lehnert,et al.  Using Decision Trees for Coreference Resolution , 1995, IJCAI.

[7]  Anastasia Ailamaki,et al.  Challenges inbuilding a DBMS Resource Advisor , 2006, IEEE Data Eng. Bull..

[8]  Diana Maynard,et al.  JAPE: a Java Annotation Patterns Engine , 2000 .

[9]  Peter J. Haas,et al.  Consistent selectivity estimation via maximum entropy , 2007, The VLDB Journal.

[10]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[11]  Frederick Reiss,et al.  SystemT: a system for declarative information extraction , 2009, SGMD.

[12]  William W. Cohen,et al.  Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text , 2005, HLT.

[13]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[14]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[15]  Mark Craven,et al.  Hierarchical Hidden Markov Models for Information Extraction , 2003, IJCAI.

[16]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[17]  Haym Hirsh,et al.  Converting numerical classification into text classification , 2003, Artif. Intell..

[18]  Nanda Kambhatla,et al.  Combining Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Information Extraction , 2004, ACL.

[19]  Sriram Raghavan,et al.  Avatar Information Extraction System , 2006, IEEE Data Eng. Bull..

[20]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[21]  Peter Siniakov,et al.  An Overview and Classification of Adaptive Approaches to Information Extraction , 2005, J. Data Semant..

[22]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[23]  Jennifer Widom,et al.  Databases with uncertainty and lineage , 2008, The VLDB Journal.

[24]  Yi-Zeng Liang,et al.  Monte Carlo cross‐validation for selecting a model and estimating the prediction error in multivariate calibration , 2004 .

[25]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[26]  Leonid Peshkin,et al.  Bayesian Information Extraction Network , 2003, IJCAI.

[27]  Douglas E. Appelt,et al.  The Common Pattern Specification Language , 1998, TIPSTER.

[28]  Andrew McCallum,et al.  Piecewise pseudolikelihood for efficient training of conditional random fields , 2007, ICML '07.

[29]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..

[30]  Douglas E. Appelt,et al.  Introduction to Information Extraction , 1999, AI Commun..

[31]  Pedro M. Domingos,et al.  Joint Unsupervised Coreference Resolution with Markov Logic , 2008, EMNLP.

[32]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[33]  Lise Getoor,et al.  Exploiting shared correlations in probabilistic databases , 2008, Proc. VLDB Endow..

[34]  Frederick Reiss,et al.  An Algebraic Approach to Rule-Based Information Extraction , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[35]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[36]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[37]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[38]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[40]  Christian Siefkes,et al.  Incremental Information Extraction Using Tree-Based Context Representations , 2005, CICLing.

[41]  Branimir Boguraev,et al.  Annotation-based finite state processing in a large-scale NLP arhitecture , 2003, RANLP.

[42]  Nanda Kambhatla,et al.  Combining Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Information Extraction , 2004, ACL.