Building a generic debugger for information extraction pipelines

Complex information extraction (IE) pipelines are becoming an integral component of most text processing frameworks. We introduce a first system to help IE users analyze extraction pipeline semantics and operator transformations interactively while debugging. This allows the effort to be proportional to the need, and to focus on the portions of the pipeline under the greatest suspicion. We present a generic debugger for running post-execution analysis of any IE pipeline consisting of arbitrary types of operators. For this, we propose an effective provenance model for IE pipelines which captures a variety of operator types, ranging from those for which full to no specifications are available. We have evaluated our proposed algorithms and provenance model on large-scale real-world extraction pipelines.

[1]  Gerhard Weikum,et al.  MING: mining informative entity relationship subgraphs , 2009, CIKM.

[2]  Panagiotis G. Ipeirotis,et al.  A quality-aware optimizer for information extraction , 2009, TODS.

[3]  Jeffrey P. Bigham,et al.  Organizing and Searching the World Wide Web of Facts - Step One: The One-Million Fact Extraction Challenge , 2006, AAAI.

[4]  Divesh Srivastava,et al.  I4E: interactive investigation of iterative information extraction , 2010, SIGMOD Conference.

[5]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[6]  Raghu Ramakrishnan,et al.  Toward best-effort information extraction , 2008, SIGMOD Conference.

[7]  Luis Gravano,et al.  Join Optimization of Information Extraction Output: Quality Matters! , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[8]  Jennifer Widom,et al.  Data Lineage: A Survey , 2009 .

[9]  Wang Chiew Tan Provenance in Databases: Past, Current, and Future , 2007, IEEE Data Eng. Bull..

[10]  Luis Gravano,et al.  Optimizing SQL Queries over Text Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[12]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[13]  Patrick Pantel,et al.  Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations , 2006, ACL.

[14]  Adriane Chapman,et al.  Understanding provenance black boxes , 2010, Distributed and Parallel Databases.

[15]  Jeffrey F. Naughton,et al.  On the provenance of non-answers to queries over extracted data , 2008, Proc. VLDB Endow..

[16]  Divesh Srivastava,et al.  Exploring a Few Good Tuples from Text Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.