Document Interrogation: Architecture, Information Extraction and Approximate Answers

We present an architecture for structuring and querying the contents of a set of documents which belong to an organization. The structure is a database which is semi-automatically populated using information extraction techniques. We provide an ontology-based language to interrogate the contents of the documents. The processing of queries in this language can give approximate answers and triggers a mechanism for improving the answers by doing additional information extraction of the textual sources. Individual database items have associated quality metadata which can be used when evaluating the quality of answers. The interaction between information extraction and query processing is a pivotal aspect of this research.

[1]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[2]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[3]  Amihai Motro,et al.  Integrity = validity + completeness , 1989, TODS.

[4]  Fabio Ciravegna,et al.  (LP) 2 , an Adaptive Algorithm for Information Extraction from Web-related Texts , 2001 .

[5]  Alberto O. Mendelzon,et al.  Publishing, locating, and querying networked information sources , 2000 .

[6]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[7]  Nicola Guarino,et al.  Formal Ontology and Information Systems , 1998 .

[8]  Raymond J. Mooney,et al.  A Mutually Beneficial Integration of Data Mining and Information Extraction , 2000, AAAI/IAAI.

[9]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[10]  Alain Pirotte,et al.  Advances in Database Technology — EDBT '92 , 1992, Lecture Notes in Computer Science.

[11]  Amihai Motro,et al.  Not all answers are equally good: estimating the quality of database answers , 1997 .

[12]  Ion Muslea,et al.  Extraction Patterns for Information Extraction Tasks: A Survey , 1999 .

[13]  Verónika Peralta,et al.  A framework for analysis of data freshness , 2004, IQIS '04.

[14]  Eric Brill,et al.  An Overview of Empirical Natural Language Processing , 1997, AI Mag..

[15]  David W. Embley,et al.  Towards Semantic Understanding -- An Approach Based on Information Extraction Ontologies , 2004, ADC.

[16]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[17]  Soraya Abad-Mota,et al.  Approximate Query Processing with Summary Tables in Statistical Databases , 1992, EDBT.

[18]  Claire Cardie,et al.  Empirical Methods in Information Extraction , 1997, AI Mag..

[19]  Mary E Califf Relational Learning Techniques for Natural Language Extraction , 1998 .

[20]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[21]  Fabio Ciravegna,et al.  Evaluating machine learning for information extraction , 2005, ICML.