Integrating Data Warehouses with Web Data: A Survey

This paper surveys the most relevant research on combining Data Warehouse (DW) and Web data. It studies the XML technologies that are currently being used to integrate, store, query and retrieve web data, and their application to DWs. The paper reviews different DW distributed architectures and the use of XML languages as an integration tool in these systems. It also introduces the problem of dealing with semi-structured data in a DW. It studies Web data repositories, the design of multidimensional databases for XML data sources and the XML extensions of On-Line Analytical Processing techniques. The paper addresses the application of information retrieval technology in a DW to exploit text-rich documents collections. The authors hope that the paper will help to discover the main limitations and opportunities that offer the combination of the DW and the Web fields, as well as, to identify open research lines.

[1]  Torben Bach Pedersen,et al.  IR and OLAP in XML Document Warehouses , 2005, ECIR.

[2]  Shiyong Lu,et al.  On the consistency of XML DTDs , 2005, Data Knowl. Eng..

[3]  Torben Bach Pedersen,et al.  Query optimization for OLAP-XML federations , 2002, DOLAP '02.

[4]  Hyoil Han,et al.  XML-OLAP: A Multidimensional Analysis Framework for XML Warehouses , 2005, DaWaK.

[5]  Hongjun Lu,et al.  An aspect of query optimization in multidatabase systems , 1995, SGMD.

[6]  Torben Bach Pedersen,et al.  Multidimensional Databases , 2005, Encyclopedia of Cryptography and Security.

[7]  M. de Rijke,et al.  Best-match querying from document-centric XML , 2004, WebDB '04.

[8]  Frank S. C. Tseng,et al.  Integrating heterogeneous data warehouses using XML technologies , 2005, J. Inf. Sci..

[9]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[10]  E. F. Codd,et al.  Providing OLAP to User-Analysts: An IT Mandate , 1998 .

[11]  Leo Obrst,et al.  The Semantic Web: A Guide to the Future of XML, Web Services and Knowledge Management , 2003 .

[12]  Steven J. DeRose,et al.  Xml linking language (xlink), version 1. 0 , 2000, WWW 2000.

[13]  Jérôme Darmont,et al.  Processing And Managing Complex Data for Decision Support , 2006 .

[14]  Vikas Arora,et al.  Native Xquery processing in oracle XMLDB , 2005, SIGMOD '05.

[15]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[16]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[17]  Rafael Berlanga Llavori,et al.  A Document Model Based on Relevance Modeling Techniques for Semi-structured Information , 2004, DEXA.

[18]  Bernard Dousset,et al.  DocCube: Multi-dimensional visualisation and exploration of large document sets , 2003, J. Assoc. Inf. Sci. Technol..

[19]  Sourav S. Bhowmick,et al.  WHOM: a data model and algebra for a web warehouse , 2001 .

[20]  Sharon C. Adler Previous version: , 1997 .

[21]  Peter Thanisch,et al.  Applying Grid Technologies to XML Based OLAP Cube Construction , 2003, DMDW.

[22]  Ophir Frieder,et al.  On the design and evaluation of a multi-dimensional approach to information retrieval (poster session) , 2000, SIGIR '00.

[23]  Stephen R. Gardner Building the data warehouse , 1998, CACM.

[24]  Ee-Peng Lim,et al.  Storage Management of a Historical Web Warehousing System , 2000, DEXA.

[25]  Torben Bach Pedersen,et al.  Specifying OLAP Cubes on XML Data , 2001, Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001.

[26]  Peter Thanisch,et al.  Constructing an OLAP cube from distributed XML data , 2002, DOLAP '02.

[27]  A Min Tjoa,et al.  Conceptual Multidimensional Data Model Based on MetaCube , 2000, ADVIS.

[28]  Wolfgang Hümmer,et al.  XCube: XML for data warehouses , 2003, DOLAP '03.

[29]  Alberto Abelló,et al.  Automating multidimensional design from ontologies , 2007, DOLAP '07.

[30]  Amélie Marian,et al.  Change-Centric Management of Versions in an XML Warehouse , 2001, VLDB.

[31]  Boris Vrdoljak,et al.  Designing Web Warehouses from XML Schemas , 2003, DaWaK.

[32]  Christian S. Jensen,et al.  A foundation for capturing and querying complex multidimensional data , 2001, Inf. Syst..

[33]  Antonio Badia Text Warehousing: Present and Future , 2006 .

[34]  Paulraj Ponniah,et al.  Data warehousing fundamentals : a comprehensive guide for IT professionals , 2001 .

[35]  Yue Zhuge,et al.  Graph structured views and their incremental maintenance , 1998, Proceedings 14th International Conference on Data Engineering.

[36]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[37]  Rafael Berlanga Llavori,et al.  XML Schemata Inference and Evolution , 2003, DEXA.

[38]  Torben Bach Pedersen,et al.  Cost Modeling and Estimation for OLAP-XML Federations , 2002, DaWaK.

[39]  Jinho Lee,et al.  MIRE: a multidimensional information retrieval engine for structured data and text , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[40]  Jinho Lee,et al.  An evaluation of the incorporation of a semantic network into a multidimensional retrieval engine , 2003, CIKM '03.

[41]  Vikas Arora,et al.  Query Rewrite for XML in Oracle XML DB , 2004, VLDB.

[42]  Torben Bach Pedersen,et al.  XML-extended OLAP querying , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[43]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[44]  Torben Bach Pedersen,et al.  Integrating XML data in the TARGIT OLAP system , 2004, Proceedings. 20th International Conference on Data Engineering.

[45]  Martin Gogolla Unified Modeling Language , 2009, Encyclopedia of Database Systems.

[46]  Deborah L. McGuinness,et al.  OWL Web ontology language overview , 2004 .

[47]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[48]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[49]  toExcel Extensible Stylesheet Language: Xsl Version 1.0 , 1999 .

[50]  Torben Bach Pedersen,et al.  A relevance-extended multi-dimensional model for a data warehouse contextualized with documents , 2005, DOLAP '05.

[51]  Il-Yeol Song,et al.  Applying UML and XML for designing and interchanging information for data warehouses and OLAP applications , 2004, J. Database Manag..

[52]  A Min Tjoa,et al.  Meta Cube-X: An XML Metadata Foundation for Interoperability Search among Web Data Warehouses , 2001, DMDW.

[53]  Laks V. S. Lakshmanan,et al.  TAX: A Tree Algebra for XML , 2001, DBPL.

[54]  Paul J. Walmsley,et al.  XML Schema Part 0: Primer Second Edition , 2004 .

[55]  Shankar Pal,et al.  XQuery Implementation in a Relational Database System , 2005, VLDB.

[56]  Jaroslav Pokorný Modelling stars using XML , 2001, DOLAP '01.

[57]  Jennifer Widom,et al.  Research problems in data warehousing , 1995, CIKM '95.

[58]  Laks V. S. Lakshmanan,et al.  X^ 3: A Cube Operator for XML OLAP , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[59]  Torben Bach Pedersen,et al.  The decoration operator: a foundation for on-line dimensional data integration , 2004, Proceedings. International Database Engineering and Applications Symposium, 2004. IDEAS '04..

[60]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[61]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[62]  A Min Tjoa,et al.  A framework for a multidimensional OLAP model using Topic Maps , 2001, Proceedings of the Second International Conference on Web Information Systems Engineering.

[63]  Boris Vrdoljak,et al.  Data warehouse design from XML sources , 2001, DOLAP '01.

[64]  Akhil Kumar,et al.  A dynamic warehouse for XML Data of the Web. , 2001 .

[65]  Hamid Pirahesh,et al.  Extending XQuery for analytics , 2005, SIGMOD '05.

[66]  Jinho Lee,et al.  On the design and evaluation of a multi-dimensional approach to information retrieval. , 2000, SIGIR 2000.

[67]  Torben Bach Pedersen,et al.  Converting XML DTDs to UML diagrams for conceptual data integration , 2001, Data Knowl. Eng..

[68]  Torben Bach Pedersen,et al.  Achieving adaptivity for OLAP-XML federations , 2003, DOLAP '03.

[69]  Mirina Grosz,et al.  World Wide Web Consortium , 2010 .

[70]  Serge Abiteboul,et al.  Monitoring XML data on the Web , 2001, SIGMOD '01.

[71]  Oscar Mangisengi,et al.  A Framework for Supporting Interoperability of Data Warehouse Islands Using XML , 2001, DaWaK.

[72]  PedersenTorben Bach,et al.  A foundation for capturing and querying complex multidimensional data , 2001 .

[73]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[74]  James E. Rumbaugh,et al.  Unified Modeling Language (UML) , 2010, Encyclopedia of Software Engineering.

[75]  Torben Bach Pedersen,et al.  Synchronizing XPath views , 2004, Proceedings. International Database Engineering and Applications Symposium, 2004. IDEAS '04..

[76]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[77]  Günther Pernul,et al.  Towards integrative enterprise knowledge portals , 2003, CIKM '03.