A Quantitative Summary of XML Structures

Statistical summaries in relational databases mainly focus on the distribution of data values and have been found useful for various applications, such as query evaluation and data storage. As xml has been widely used, e.g. for online data exchange, the need for (corresponding) statistical summaries in xml has been evident. While relational techniques may be applicable to the data values in xml documents, novel techniques are requried for summarizing the structures of xml documents. In this paper, we propose metrics for major structural properties, in particular, nestings of entities and one-to-many relationships, of XML documents. Our technique is different from the existing ones in that we generate a quantitative summary of an xml structure. By using our approach, we illustrate that some popular real-world and synthetic xml benchmark datasets are indeed highly skewed and hardly hierarchical and contain few recursions. We wish this preliminary finding shreds insight on improving the design of xml benchmarking and experimentations.

[1]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[2]  Byron Choi Document Decomposition for XML Compression: A Heuristic Approach , 2006, DASFAA.

[3]  Jeffrey F. Naughton,et al.  A general technique for querying XML documents using a relational database system , 2001, SGMD.

[4]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[5]  Juliana Freire,et al.  From XML schema to relations: a cost-based approach to XML storage , 2002, Proceedings 18th International Conference on Data Engineering.

[6]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[7]  Jennifer Widom,et al.  Query Optimization for XML , 1999, VLDB.

[8]  M. Tamer Özsu,et al.  XBench benchmark and performance testing of XML DBMSs , 2004, Proceedings. 20th International Conference on Data Engineering.

[9]  Torsten Grust,et al.  MonetDB/XQuery: a fast XQuery processor powered by a relational engine , 2006, SIGMOD Conference.

[10]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[11]  Wenfei Fan,et al.  Incremental evaluation of schema-directed XML publishing , 2004, SIGMOD '04.

[12]  Jignesh M. Patel,et al.  The Michigan benchmark: towards XML query performance diagnostics , 2006, Inf. Syst..

[13]  Neoklis Polyzotis,et al.  Statistical synopses for graph-structured XML databases , 2002, SIGMOD '02.

[14]  Daniela Florescu,et al.  Storing and Querying XML Data using an RDMBS , 1999, IEEE Data Eng. Bull..

[15]  Dan Suciu,et al.  XMill: an efficient compressor for XML data , 2000, SIGMOD '00.

[16]  Juliana Freire,et al.  StatiX: making XML count , 2002, SIGMOD '02.

[17]  Sourav S. Bhowmick,et al.  Efficient recursive XML query processing using relational database systems , 2004, Data Knowl. Eng..

[18]  Sven Helmer,et al.  Anatomy of a native XML base management system , 2002, The VLDB Journal.

[19]  Alin Deutsch,et al.  Storing semistructured data with STORED , 1999, SIGMOD '99.

[20]  Cong Yu,et al.  TIMBER: a native system for querying XML , 2003, SIGMOD '03.

[21]  Byron Choi,et al.  What are real DTDs like? , 2002, WebDB.

[22]  Victor Vianu,et al.  Validating streaming XML documents , 2002, PODS.

[23]  Frank Neven,et al.  DTDs versus XML schema: a practical study , 2004, WebDB '04.

[24]  Susan B. Davidson,et al.  From XML View Updates to Relational View Updates: old solutions to a new problem , 2004, VLDB.

[25]  Divesh Srivastava,et al.  Counting twig matches in a tree , 2001, Proceedings 17th International Conference on Data Engineering.

[26]  James Cheney Compressing XML with multiplexed hierarchical PPM models , 2001, Proceedings DCC 2001. Data Compression Conference.

[27]  Vishu Krishnamurthy,et al.  Performance Challenges in Object-Relational DBMSs , 1999, IEEE Data Eng. Bull..

[28]  Wenfei Fan,et al.  Vectorizing and querying large XML repositories , 2005, 21st International Conference on Data Engineering (ICDE'05).