Efficiently incorporating user feedback into information extraction and integration programs

Many applications increasingly employ information extraction and integration (IE/II) programs to infer structures from unstructured data. Automatic IE/II are inherently imprecise. Hence such programs often make many IE/II mistakes, and thus can significantly benefit from user feedback. Today, however, there is no good way to automatically provide and process such feedback. When finding an IE/II mistake, users often must alert the developer team (e.g., via email or Web form) about the mistake, and then wait for the team to manually examine the program internals to locate and fix the mistake, a slow, error-prone, and frustrating process. In this paper we propose a solution for users to directly provide feedback and for IE/II programs to automatically process such feedback. In our solution a developer U uses hlog, a declarative IE/II language, to write an IE/II program P. Next, U writes declarative user feedback rules that specify which parts of P's data (e.g., input, intermediate, or output data) users can edit, and via which user interfaces. Next, the so-augmented program P is executed, then enters a loop of waiting for and incorporating user feedback. Given user feedback F on a data portion of P, we show how to automatically propagate F to the rest of P, and to seamlessly combine F with prior user feedback. We describe the syntax and semantics of hlog, a baseline execution strategy, and then various optimization techniques. Finally, we describe experiments with real-world data that demonstrate the promise of our solution.

[1]  Jeffrey F. Naughton,et al.  Information extraction challenges in managing unstructured data , 2009, SGMD.

[2]  Leonid Libkin,et al.  Incremental maintenance of views with duplicates , 1995, SIGMOD '95.

[3]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[4]  Jun Yang,et al.  Optimizing complex extraction programs over evolving text data , 2009, SIGMOD Conference.

[5]  Jeffrey F. Naughton,et al.  On the provenance of non-answers to queries over extracted data , 2008, Proc. VLDB Endow..

[6]  Surajit Chaudhuri,et al.  Maintenance of Materialized Views: Problems, Techniques, and Applications. , 1995 .

[7]  Raghu Ramakrishnan,et al.  Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach , 2007, VLDB.

[8]  S. B. Yao,et al.  Efficient locking for concurrent operations on B-trees , 1981, TODS.

[9]  Peter Buneman,et al.  Provenance in databases , 2009, SIGMOD '07.

[10]  Frank Wm. Tompa,et al.  Efficiently updating materialized views , 1986, SIGMOD '86.

[11]  Jun Yang,et al.  Efficient Information Extraction over Evolving Text Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[13]  Frederick Reiss,et al.  An Algebraic Approach to Rule-Based Information Extraction , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[14]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[15]  Xiaojin Zhu,et al.  Building Community Wikipedias: A Machine-Human Partnership Approach , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[16]  Raghu Ramakrishnan,et al.  DBLife: A Community Information Management Platform for the Database Research Community (Demo) , 2007, CIDR.

[17]  Gerhard Weikum,et al.  The YAGO-NAGA approach to knowledge discovery , 2009, SGMD.

[18]  Raghu Ramakrishnan,et al.  Toward best-effort information extraction , 2008, SIGMOD Conference.

[19]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS.

[20]  Raghu Ramakrishnan,et al.  Community Information Management , 2006, IEEE Data Eng. Bull..

[21]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[22]  Jennifer Widom,et al.  Lineage tracing for general data warehouse transformations , 2003, The VLDB Journal.

[23]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[24]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[25]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[26]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[27]  Inderpal Singh Mumick,et al.  The Stanford Data Warehousing Project , 1995 .

[28]  Val Tannen,et al.  Update Exchange with Mappings and Provenance , 2007, VLDB.

[29]  Philip A. Bernstein,et al.  Incremental schema matching , 2006, VLDB.

[30]  Phokion G. Kolaitis,et al.  Interactive generation of integrated schemas , 2008, SIGMOD Conference.

[31]  Irving L. Traiger,et al.  Granularity of Locks and Degrees of Consistency in a Shared Data Base , 1998, IFIP Working Conference on Modelling in Data Base Management Systems.

[32]  Clement T. Yu,et al.  An interactive clustering-based approach to integrating source query interfaces on the deep Web , 2004, SIGMOD '04.

[33]  Cong Yu,et al.  Purple SOX extraction management system , 2009, SGMD.