Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages

Abstract Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document's content. For these kinds of data-rich, multiple-record documents (e.g., advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data automatically. The approach is based on an ontology – a conceptual model instance – that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents and structure it according to the generated database scheme. Experiments show that it is possible to achieve good recall and precision ratios for documents that are rich in recognizable constants and narrow in ontological breadth. Our approach is less labor-intensive than other approaches that manually or semiautomatically generate wrappers, and it is generally insensitive to changes in Web-page format.

[1]  András Kornai Extended finite state models of language , 1996, Nat. Lang. Eng..

[2]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[3]  George Luger,et al.  Artificial Intelligence: Structures and Strategies for Complex Problem Solving (5th Edition) , 2004 .

[4]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases, and webs , 1999 .

[5]  Gregory Grefenstette,et al.  Regular expressions for language engineering , 1996, Natural Language Engineering.

[6]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[7]  David W. Embley,et al.  Ontology-based extraction and structuring of information from data-rich unstructured documents , 1998, CIKM '98.

[8]  Tok Wang Ling,et al.  17th International Conference on Conceptual Modeling (ER'98) , 1999, Data Knowl. Eng..

[9]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[10]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1997, International Journal on Digital Libraries.

[11]  David W. Embley,et al.  A Conceptual-Modeling Approach to Extracting Data from the Web , 1998, ER.

[12]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[13]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[14]  Guido Moerkotte,et al.  Querying documents in object databases , 1997, International Journal on Digital Libraries.

[15]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[16]  Anand Rajaraman,et al.  Virtual database technology , 1997, Proceedings 14th International Conference on Data Engineering.

[17]  ChengXiang Zhai,et al.  Noun-Phrase Analysis in Unrestricted Text for Information Retrieval , 1996, ACL.

[18]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[19]  Peter M. G. Apers Identifying Internet-related Database Research , 1994, East/West Database Workshop.

[20]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[21]  David W. Embley,et al.  An Active, Object-Oriented, Model-Equivalent Programming Language , 1995, Advances in Object-Oriented Data Modeling.

[22]  Craig A. Knoblock,et al.  Wrapper generation for semi-structured Internet sources , 1997, SGMD.

[23]  Paolo Merialdo,et al.  To Weave the Web , 1997, VLDB.

[24]  Dan Smith,et al.  Information extraction for semi-structured documents , 1997 .

[25]  David W. Embley Programming with data frames for everyday data items , 1980, AFIPS '80.

[26]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[27]  Lois M. L. Delcambre,et al.  Structured Maps: modeling explicit semantics over a universe of information , 1996, International Journal on Digital Libraries.

[28]  David W. Embley,et al.  Object-oriented systems analysis - a model-driven approach , 1991, Yourdon Press Computing series.

[29]  Sharon Flank,et al.  A Layered Approach to NLP-Based Information Retrieval , 1998, ACL.

[30]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[31]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .

[32]  Paolo Atzeni,et al.  Cut and paste , 1997, PODS '97.

[33]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[34]  Laks V. S. Lakshmanan,et al.  A declarative language for querying and restructuring the Web , 1996, Proceedings RIDE '96. Sixth International Workshop on Research Issues in Data Engineering.

[35]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[36]  Stephen Soderland,et al.  Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.