Toward best-effort information extraction

Current approaches to develop information extraction (IE) programs have largely focused on producing precise IE results. As such, they suffer from three major limitations. First, it is often difficult to execute partially specified IE programs and obtain meaningful results, thereby producing a long "debug loop". Second, it often takes a long time before we can obtain the first meaningful result (by finishing and running a precise IE program), thereby rendering these approaches impractical for time-sensitive IE applications. Finally, by trying to write precise IE programs we may also waste a significant amount of effort, because an approximate result -- one that can be produced quickly -- may already be satisfactory in many IE settings. To address these limitations, we propose iFlex, an IE approach that relaxes the precise IE requirement to enable best-effort IE. In iFlex, a developer U uses a declarative language to quickly write an initial approximate IE program P with a possible-worlds semantics. Then iFlex evaluates P using an approximate query processor to quickly extract an approximate result. Next, U examines the result, and further refines P if necessary, to obtain increasingly more precise results. To refine P, U can enlist a next-effort assistant, which suggests refinements based on the data and the current version of P. Extensive experiments on real-world domains demonstrate the utility of the iFlex approach.

[1]  Raghu Ramakrishnan,et al.  Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach , 2007, VLDB.

[2]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[3]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[4]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[5]  AnHai Doan,et al.  Efficient data integration: automation, collaboration, and relaxation , 2007 .

[6]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[7]  J. Birch,et al.  Interactive Data Analysis , 1978 .

[8]  Ronen Feldman,et al.  High-Performance Unsupervised Relation Extraction from Large Corpora , 2006, Sixth International Conference on Data Mining (ICDM'06).

[9]  Sunita Sarawagi,et al.  Scalable Information Extraction and Integration. , 2006 .

[10]  Dan Suciu,et al.  Answering Queries from Statistics and Probabilistic Views , 2005, VLDB.

[11]  Raghu Ramakrishnan,et al.  Managing information extraction: state of the art and research directions , 2006, SIGMOD Conference.

[12]  David W. Embley,et al.  Automatic Creation and Simplified Querying of Semantic Web Content: An Approach Based on Information-Extraction Ontologies , 2006, ASWC.

[13]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[14]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS.

[15]  Peter J. Haas,et al.  Interactive data Analysis: The Control Project , 1999, Computer.

[16]  David W. Embley,et al.  Enriching OWL with Instance Recognition Semantics for Automated Semantic Annotation , 2007, ER Workshops.

[17]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[18]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[19]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.