Synthetic data generation: theory, techniques and applications

The need for synthetically generated data is growing rapidly as the size of enterprise applications increases. Situations requiring this technology include regression testing of database applications, data mining applications, and the need to supply "realistic but not real" data for third party application development. The common approach today to supplying this need involves the manual creation of special-purpose data generators for specific data sets. This dissertation describes a general purpose synthetic data generation framework. Such a framework significantly speeds up the process of describing and generating synthetic data. The framework includes a language called SDDL that is capable of describing complex data sets and a generation engine called SDG which supports parallel data generation. Related theory in the areas of the relational model, E-R diagrams, randomness and data obfuscation is explored. Finally, the power and flexibility of the SDG/SDDL framework are demonstrated by applying the framework to a collection of applications.