XBeGene: Scalable XML Documents Generator by Example Based on Real Data

XML datasets of various sizes and properties are needed to evaluate the correctness and efficiency of XML-based algorithms and applications. While several downloadable datasets can be found online, these are predefined by system experts and might not be suitable to evaluate every algorithm. Tools for generating synthetic XML documents underline an alternative solution, promoting flexibility and adaptability in generating synthetic document collections. Nonetheless, the usefulness of existing XML generators remains rather limited due to the restricted levels of expressiveness allowed to users. In this paper, we develop a novel XML By example Generator (XBeGene) for producing synthetic XML data which closely reflect the user’s requirements. Inspired by the query-by-example paradigm in information retrieval, Our generator system i)allows the user to provide her own sample XML documents as input, ii) analyzes the structure, occurrence frequencies, and content distributions for each XML element in the user input documents, and iii) produces synthetic XML documents which closely concur, in both structural and content features, to the user’s input data. The size of each synthetic document as well as that of the entire document collection are also specified by the user. Clustering experiments demonstrate high correlation levels between the specified user requirements and the characteristics of the generated XML data, while timing results confirm our approach’s scalability to large scale document collections.

[1]  Sven Helmer,et al.  Measuring the Structural Similarity of Semistructured Documents Using Entropy , 2007, VLDB.

[2]  Sara Cohen Generating XML structure using examples and constraints , 2008, Proc. VLDB Endow..

[3]  Alberto H. F. Laender Conceptual Modeling - ER 2009, 28th International Conference on Conceptual Modeling, Gramado, Brazil, November 9-12, 2009. Proceedings , 2009, ER.

[4]  Thomas Eiter,et al.  Database Theory - Icdt 2005 , 2008 .

[5]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[6]  Jignesh M. Patel,et al.  The Michigan benchmark: towards XML query performance diagnostics , 2006, Inf. Syst..

[7]  Richard Chbeir,et al.  Extensible User-Based XML Grammar Matching , 2009, ER.

[8]  Wiebe van der Hoek,et al.  SOFSEM 2007: Theory and Practice of Computer Science , 2007 .

[9]  Denilson Barbosa,et al.  ToXgene: a template-based data generator for XML , 2002, SIGMOD '02.

[10]  M. Tamer Özsu,et al.  XBench benchmark and performance testing of XML DBMSs , 2004, Proceedings. 20th International Conference on Data Engineering.

[11]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[12]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[13]  Jeffrey F. Naughton,et al.  Generating Synthetic Complex-Structured XML Data , 2001, WebDB.

[14]  Elisa Bertino,et al.  A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[15]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[16]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[17]  Terrence J. Moran Impact Of XML On Data Interchange: An XML/EDI Model , 2011, BIS 2011.

[18]  Denilson Barbosa,et al.  ToXgene: An extensible template-based data generator for XML , 2002, WebDB.

[19]  Isabelle Tellier,et al.  Transforming XML Trees for Efficient Classification and Clustering , 2005, INEX.

[20]  Biplav Srivastava,et al.  A system for knowledge management in bioinformatics , 2002, CIKM '02.

[21]  Timos K. Sellis,et al.  A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[22]  Matthias Nicola,et al.  An XML transaction processing benchmark , 2007, SIGMOD '07.

[23]  Fernando Pereira TECHNOLOGIES FOR DIGITAL MULTIMEDIA COMMUNICATIONS: AN EVOLUTIONAL ANALYSIS OF MPEG STANDARDS , 2006 .

[24]  Roy Goldman,et al.  From Semistructured Data to XML: Migrating the Lore Data Model and Query Language , 1999, Markup Lang..

[25]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[26]  Richard Chbeir,et al.  A Hybrid Approach for XML Similarity , 2007, SOFSEM.

[27]  Sihem Amer-Yahia,et al.  XML Full-Text Search: Challenges and Opportunities , 2005, VLDB.