A Conceptual-Modeling Approach to Extracting Data from the Web

Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document’s content. For these kinds of data-rich documents (e.g., advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data. The approach is based on an ontology – a conceptual model instance – that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents and structure it according to the generated database scheme. Experiments show that it is possible to achieve good recall and precision ratios for documents that are rich in recognizable constants and narrow in ontological breadth.

[1]  Guido Moerkotte,et al.  Querying documents in object databases , 1997, International Journal on Digital Libraries.

[2]  David W. Embley,et al.  Object-oriented systems analysis - a model-driven approach , 1991, Yourdon Press Computing series.

[3]  David W. Embley,et al.  Ontology-based extraction and structuring of information from data-rich unstructured documents , 1998, CIKM '98.

[4]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[5]  David W. Embley Object database development - concepts and principles , 1997 .

[6]  Paolo Atzeni,et al.  Cut and paste , 1997, PODS '97.

[7]  Stephen Soderland,et al.  Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.

[8]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[9]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[10]  David W. Embley Programming with data frames for everyday data items , 1980, AFIPS '80.

[11]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[12]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.

[13]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[14]  Laks V. S. Lakshmanan,et al.  A declarative language for querying and restructuring the Web , 1996, Proceedings RIDE '96. Sixth International Workshop on Research Issues in Data Engineering.

[15]  Anand Rajaraman,et al.  Virtual database technology , 1997, Proceedings 14th International Conference on Data Engineering.

[16]  Peter M. G. Apers Identifying Internet-related Database Research , 1994, East/West Database Workshop.

[17]  David W. Embley,et al.  An Active, Object-Oriented, Model-Equivalent Programming Language , 1995, Advances in Object-Oriented Data Modeling.

[18]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[19]  Craig A. Knoblock,et al.  Wrapper generation for semi-structured Internet sources , 1997, SGMD.

[20]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1997, International Journal on Digital Libraries.

[21]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[22]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[23]  Dan Smith,et al.  Information extraction for semi-structured documents , 1997 .

[24]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[25]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.