Declarative generation of synthetic XML data

Synthetic data can be extremely useful in testing and evaluating algorithms, tools and systems. Most synthetic data generators available today are the result of individual benchmarking efforts. Typically, these are complex programs in which the specifications of both the structure and the contents of the data are hard‐coded. As a result, it is often difficult to customize these tools for producing synthetic data tailored for specific needs. In this article, we describe the ToXgene synthetic data generator, which is a declarative tool for generating realistic XML data for benchmarking as well as testing purposes. We present our template specification language, which consists of augmenting XML Schema with probabilistic models that guide the data‐generation process. We discuss the architecture of our current implementation and we argue about ToXgene's usefulness by discussing experimental results as well as describing two projects that use our tool. Copyright © 2006 John Wiley & Sons, Ltd.

[1]  Denilson Barbosa,et al.  ToX - the Toronto XML Engine , 2001, Workshop on Information Integration on the Web.

[2]  Ioana Manolescu,et al.  A Benchmark for XML Data Management , 2002 .

[3]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[4]  Jan Mendling,et al.  Business Process Execution Language for Web Services , 2006, EMISA Forum.

[5]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[6]  J W Ballard,et al.  Data on the web? , 1995, Science.

[7]  Jianwen Su,et al.  E-services: a look behind the curtain , 2003, PODS.

[8]  Wenfei Fan,et al.  Keys for XML , 2001, WWW '01.

[9]  Denilson Barbosa,et al.  Studying the XML Web: Gathering Statistics from an XML Sample , 2006, World Wide Web.

[10]  Erhard Rahm,et al.  XMach-1: A Benchmark for XML Data Management , 2001, BTW.

[11]  Arto Salomaa,et al.  Probabilistic and Weighted Grammars , 1969, Inf. Control..

[12]  Matjaz B. Juric,et al.  Business process execution language for web services , 2004 .

[13]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[14]  Jon Bosak,et al.  XML, Java, and the Future of the Web , 1997, World Wide Web J..

[15]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[16]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[17]  Paul J. Walmsley,et al.  XML Schema Part 0: Primer Second Edition , 2004 .

[18]  Dongwon Lee,et al.  Comparative analysis of six XML schema languages , 2000, SGMD.

[19]  Denilson Barbosa,et al.  ToXgene: An extensible template-based data generator for XML , 2002, WebDB.

[20]  Luc Segoufin,et al.  Typing and querying XML documents: some complexity bounds , 2003, PODS.

[21]  M. Tamer Özsu,et al.  XBench benchmark and performance testing of XML DBMSs , 2004, Proceedings. 20th International Conference on Data Engineering.

[22]  Jeffrey F. Naughton,et al.  Generating Synthetic Complex-Structured XML Data , 2001, WebDB.

[23]  Susan Elliott Sim,et al.  Using benchmarking to advance research: a challenge to software engineering , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[24]  Jim Gray,et al.  Benchmark Handbook: For Database and Transaction Processing Systems , 1992 .

[25]  V. S. Subrahmanian,et al.  PXML: a probabilistic semistructured data model and algebra , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[26]  Ben Taskar,et al.  Learning Probabilistic Models of Link Structure , 2003, J. Mach. Learn. Res..

[27]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.