Warehousing complex data from the web

Data warehousing and Online Analytical Processing (OLAP) technologies are now moving onto handling complex data that mostly originate from the web. However, integrating such data into a decision-support process requires their representation in a form processable by OLAP and/or data mining techniques. We present in this paper a complex data warehousing methodology that exploits eXtensible Markup Language (XML) as a pivot language. Our approach includes the integration of complex data in an ODS, in the form of XML documents; their dimensional modelling and storage in an XML data warehouse; and their analysis with combined OLAP and data mining techniques. We also address the crucial issue of performance in XML warehouses.

[1]  Kyuseok Shim,et al.  Data mining and the Web: past, present and future , 1999, WIDM '99.

[2]  Richard D. Hackathorn,et al.  Web Farming for the Data Warehouse , 1998 .

[3]  Ehud Gudes,et al.  Exploiting local similarity for indexing paths in graph-structured data , 2002, Proceedings 18th International Conference on Data Engineering.

[4]  Ji Zhang,et al.  X-warehouse: building query pattern-driven data , 2005, WWW '05.

[5]  Hyoil Han,et al.  XML-OLAP: A Multidimensional Analysis Framework for XML Warehouses , 2005, DaWaK.

[6]  Hadj Mahboubi,et al.  Materialized View Selection by Query Clustering in XML Data Warehouses , 2008, ArXiv.

[7]  J. Wenny Rahayu,et al.  Conceptual Design of XML Document Warehouses , 2004, DaWaK.

[8]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[9]  George Lawton Making Business Intelligence More Useful , 2006, Computer.

[10]  Omar Boussaïd,et al.  A multi-agent system-based ETL approach for complex data , 2003, ISPE CE.

[11]  François Goasdoué,et al.  The Use of CARIN Language and Algorithms for Information Integration: The PICSEL System , 2000, Int. J. Cooperative Inf. Syst..

[12]  Jérôme Darmont,et al.  Clustering-Based Materialized View Selection in Data Warehouses , 2006, ADBIS.

[13]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[14]  Marie-Christine Rousset Knowledge Representation for Information Integration , 2002, ISMIS.

[15]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[16]  Jérôme Darmont,et al.  Processing And Managing Complex Data for Decision Support , 2006 .

[17]  Kenneth Ward Church,et al.  Virtual Data Warehousing, Data Publishing, and Call Detail , 1999, Databases in Telecommunications.

[18]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[19]  Matteo Golfarelli,et al.  Designing the Data Warehouse: Key Steps and Crucial Issues , 1999 .

[20]  J. Wenny Rahayu,et al.  Conceptual and Systematic Design Approach for XML Document Warehouses , 2005, Int. J. Data Warehous. Min..

[21]  Andrew Lim,et al.  D(k)-index: an adaptive structural summary for graph-structured data , 2003, SIGMOD '03.

[22]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[23]  Carlo Zaniolo,et al.  Efficient Structural Joins on Indexed XML Documents , 2002, VLDB.

[24]  Sabine Loudcher,et al.  A Data Mining-Based OLAP Aggregation of Complex Data: Application on XML Documents , 2006, Int. J. Data Warehous. Min..

[25]  Jiawei Han,et al.  Towards on-line analytical mining in large databases , 1998, SGMD.

[26]  Omar Boussaïd,et al.  Integration and dimensional modeling approaches for complex data warehousing , 2007, J. Glob. Optim..

[27]  Richi Nayak,et al.  Data Mining and XML Documents , 2002, International Conference on Internet Computing.

[28]  Katia P. Sycara,et al.  Coordination of Multiple Intelligent Software Agents , 1996, Int. J. Cooperative Inf. Syst..

[29]  Awais Rashid,et al.  XML Data Management: Native XML and XML-Enabled Database Systems , 2003 .

[30]  Hao He,et al.  Multiresolution indexing of XML for frequent queries , 2004, Proceedings. 20th International Conference on Data Engineering.

[31]  Ee-Peng Lim,et al.  DTD-Miner: a tool for mining DTD from XML documents , 2000, Proceedings Second International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems. WECWIS 2000.

[32]  Boris Vrdoljak,et al.  Data warehouse design from XML sources , 2001, DOLAP '01.

[33]  W. H. Inmon,et al.  Building the data warehouse , 1992 .

[34]  Omar Boussaïd,et al.  Fouille dans la structure de documents XML , 2004, EGC.

[35]  Sabine Loudcher,et al.  A new OLAP aggregation based on the AHC technique , 2004, DOLAP '04.

[36]  Elisa Bertino,et al.  XJoin index: indexing XML data for efficient handling of branching path expressions , 2004, Proceedings. 20th International Conference on Data Engineering.

[37]  Alok N. Choudhary,et al.  High Performance Multidimensional Analysis and Data Mining , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[38]  M. Calisti,et al.  FOUNDATION FOR INTELLIGENT PHYSICAL AGENTS , 2000 .

[39]  Michael J. Franklin,et al.  A Fast Index for Semistructured Data , 2001, VLDB.

[40]  Beng Chin Ooi,et al.  XR-tree: indexing XML data for efficient structural joins , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[41]  Wolfgang Hümmer,et al.  XCube: XML for data warehouses , 2003, DOLAP '03.

[42]  Boris Vrdoljak,et al.  Designing Web Warehouses from XML Schemas , 2003, DaWaK.

[43]  Alessandro Campi,et al.  Mining Association Rules from XML Data , 2002, DaWaK.

[44]  Omar Boussaïd,et al.  X-Warehousing: An XML-Based Approach for Warehousing Complex Data , 2006, ADBIS.

[45]  Shant Kirakos Karakashian,et al.  A New Design for a Native XML Storage and Indexing Manager , 2006, EDBT.

[46]  Omar Boussaïd,et al.  Complex Data Integration Based on a Multi-agent System , 2003, HoloMAS.

[47]  Kyuseok Shim,et al.  APEX: an adaptive path index for XML data , 2002, SIGMOD '02.

[48]  David J. DeWitt,et al.  On supporting containment queries in relational database management systems , 2001, SIGMOD '01.

[49]  Heikki Mannila,et al.  A database perspective on knowledge discovery , 1996, CACM.

[50]  Richard H. Moore,et al.  THE DIGITAL DATABASE FOR SCREENING MAMMOGRAPHY , 2007 .

[51]  Jaroslav Pokorný XML Data Warehouse: Modelling and Querying , 2002, BalticDB&IS.

[52]  Sabine Loudcher,et al.  Web multiform data structuring for warehousing , 2003 .

[53]  Kathy Potosnak,et al.  Conceptual design , 1999, CHI Extended Abstracts.

[54]  Nicolas Lhuillier,et al.  FOUNDATION FOR INTELLIGENT PHYSICAL AGENTS , 2003 .

[55]  W. H. Inmon,et al.  Building the data warehouse (2nd ed.) , 1996 .

[56]  Nimrod Megiddo,et al.  Discovery-Driven Exploration of OLAP Data Cubes , 1998, EDBT.

[57]  Il-Yeol Song,et al.  Applying UML and XML for designing and interchanging information for data warehouses and OLAP applications , 2004, J. Database Manag..

[58]  Ehud Gudes,et al.  Exploiting Local Similarity to Efficiently Index Paths in Graph-Structured Data , 2002 .

[59]  Omar Boussaïd,et al.  An Architecture Framework for Complex Data Warehouses , 2007, ICEIS.

[60]  Hadj Mahboubi,et al.  Un index de jointure pour les entrepôts des données XML , 2006, EGC.