Using Object-Grammars for Internet Data Warehousing

AbstractThe increasing amount of information available in the web demands sophisticated queryingmethods and knowledge discovery techniques. In this study, we introduce our model WIND fora data warehouse over a domain-specific portion of the Internet. The aim of WIND is to providea partially materialized structured view onto a thematic section of the web, on which databasequerying can be applied and mining techniques can be developed.WIND organizes web documents into local repositories with functionalities ranging fromOODBMSs to file systems. This allows for a combination of attribute and content-oriented queryprocessing. Special interest is paid to the format specifications of document s, where the notionof format is extended to cover characteristics and constraints that hold on the subject domain.To support conversion between (semi-)structured documents and database objects, we consider aformat converter generation technique based on the notion of object-grammars.Keywords: data warehouse, web, mining, information retrieval, format conversion, grammars

[1]  David Jordan,et al.  The Object Database Standard: ODMG 2.0 , 1997 .

[2]  Martti Penttonen,et al.  Transformation of Structured Documents with the Use of Grammar , 1993, Electron. Publ..

[3]  W. H. Inmon,et al.  The data warehouse and data mining , 1996, CACM.

[4]  Fernando Pereira,et al.  Definite clause grammars for language analysis , 1986 .

[5]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[6]  Klemens Böhm,et al.  The Prospects of Publishing using Advanced Database Concepts , 1993, Electron. Publ..

[7]  Serge Abiteboul,et al.  Correspondence and translation for heterogeneous data , 1997, Theor. Comput. Sci..

[8]  Eelco Visser,et al.  Generation of formatters for context-free languages , 1996, TSEM.

[9]  Serge Abiteboul,et al.  Querying and Updating the File , 1993, VLDB.

[10]  Chad Carson,et al.  Optimizing queries over multimedia repositories , 1996, SIGMOD '96.

[11]  Klemens Böhm,et al.  HyperStorM—administering structured documents using object-oriented database technology , 1996, SIGMOD '96.

[12]  R. G. G. Cattell,et al.  The Object Database Standard: ODMG-93 , 1993 .

[13]  Volker Linnemann,et al.  Attributierte Grammatiken als Werkzeug der Datenmodellierung , 1995, BTW.

[14]  Serge Abiteboul,et al.  A database interface for file update , 1995, SIGMOD '95.

[15]  Oren Etzioni,et al.  The World-Wide Web: quagmire or gold mine? , 1996, CACM.

[16]  Toshiro Wakayama,et al.  SIMON: A Grammar-based Transformation System for Structured Documents , 1993, Electron. Publ..

[17]  Jennifer Widom,et al.  Maintenance of Materialized Views: Problems, Techniques, and Applications , 1999, IEEE Data Eng. Bull..

[18]  Jukka Paakki,et al.  Attribute grammar paradigms—a high-level methodology in language implementation , 1995, CSUR.

[19]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[20]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[21]  W. H. Inmon,et al.  Rdb/VMS: Developing the Data Warehouse , 1993 .