Search Driven Analysis of Heterogenous XML Data

Analytical processing on XML repositories is usually enabled by designing complex data transformations that shred the documents into a common data warehousing schema. This can be very time- consuming and costly, especially if the underlying XML data has a lot of variety in structure, and only a subset of attributes constitutes meaningful dimensions and facts. Today, there is no tool to explore an XML data set, discover interesting attributes, dimensions and facts, and rapidly prototype an OLAP solution. In this paper, we propose a system, called SEDA 1 , that enables users to start with simple keyword-style querying, and interactively refine the query based on result summaries. SEDA then maps query results onto a set of known, or newly created, facts and dimensions, and derives a star schema and its instantiation to be fed into an off- the-shelf OLAP tool, for further analysis.

[1]  Hamid Pirahesh,et al.  Extending XQuery for analytics , 2005, SIGMOD '05.

[2]  Cong Yu,et al.  Efficient discovery of XML data redundancies , 2006, VLDB.

[3]  Wisam Dakka Automatic Discovery of Useful Facet Terms , 2006 .

[4]  Wenfei Fan,et al.  Keys for XML , 2001, WWW '01.

[5]  Jianzhong Li,et al.  OLAP for XML Data , 2005, The Fifth International Conference on Computer and Information Technology (CIT'05).

[6]  Yi Chen,et al.  Identifying meaningful return information for XML keyword search , 2007, SIGMOD '07.

[7]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[8]  Cong Yu,et al.  Schema-Free XQuery , 2004, VLDB.

[9]  Yannis Papakonstantinou,et al.  Efficient keyword search for smallest LCAs in XML databases , 2005, SIGMOD '05.

[10]  Shinichi Morishita,et al.  Amoeba Join: Overcoming Structural Fluctuations in XML Data , 2006, WebDB.

[11]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[12]  Alberto O. Mendelzon,et al.  Reasoning about Summarizability in Heterogeneous Multidimensional Schemas , 2001, ICDT.

[13]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[14]  Chee Yong Chan,et al.  Multiway SLCA-based keyword search in XML data , 2007, WWW '07.

[15]  Laks V. S. Lakshmanan,et al.  X^ 3: A Cube Operator for XML OLAP , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[16]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[17]  Paul Brown,et al.  GORDIAN: efficient and scalable discovery of composite keys , 2006, VLDB.

[18]  Yehoshua Sagiv,et al.  XSEarch: A Semantic Search Engine for XML , 2003, VLDB.

[19]  Jeffrey D. Ullman,et al.  Representative objects: concise representations of semistructured, hierarchical data , 1997, Proceedings 13th International Conference on Data Engineering.

[20]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[21]  Gerhard Weikum,et al.  An Efficient and Versatile Query Engine for TopX Search , 2005, VLDB.

[22]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[23]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[24]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[25]  Quanzhong Li,et al.  On the Effectiveness of Flexible Querying Heuristics for XML Data , 2007, XSym.

[26]  Daniel Tunkelang Dynamic Category Sets: An Approach for Faceted Search , 2006 .

[27]  Quanzhong Li,et al.  SEDA: a system for search, exploration, discovery, and analysis of XML Data , 2008, Proc. VLDB Endow..

[28]  Eugene J. Shekita,et al.  Beyond basic faceted search , 2008, WSDM '08.