A Bayes-true data generator for evaluation of supervised and unsupervised learning methods

Benchmarking pattern recognition, machine learning and data mining methods commonly relies on real-world data sets. However, there are some disadvantages in using real-world data. On one hand collecting real-world data can become difficult or impossible for various reasons, on the other hand real-world variables are hard to control, even in the problem domain; in the feature domain, where most statistical learning methods operate, exercising control is even more difficult and hence rarely attempted. This is at odds with the scientific experimentation guidelines mandating the use of as directly controllable and as directly observable variables as possible. Because of this, synthetic data possesses certain advantages over real-world data sets. In this paper we propose a method that produces synthetic data with guaranteed global and class-specific statistical properties. This method is based on overlapping class densities placed on the corners of a regular k-simplex. This generator can be used for algorithm testing and fair performance evaluation of statistical learning methods. Because of the strong properties of this generator researchers can reproduce each others experiments by knowing the parameters used, instead of transmitting large data sets.

[1]  Horst Bunke,et al.  Generation and Use of Synthetic Training Data in Cursive Handwriting Recognition , 2003, IbPRIA.

[2]  S. Fomin,et al.  Y-systems and generalized associahedra , 2001, hep-th/0111053.

[3]  Jim Graham,et al.  Using statistical image models for objective evaluation of spot detection in two‐dimensional gels , 2003, Proteomics.

[4]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[5]  Henry S. Baird,et al.  The State of the Art of Document Image Degradation Modelling , 2007 .

[6]  Yaling Pei,et al.  A Synthetic Data Generator for Clustering and Outlier Analysis , 2006 .

[7]  Dmitri A. Rachkovskij,et al.  DataGen: a generator of datasets for evaluation of classification algorithms , 1998, Pattern Recognit. Lett..

[8]  Etienne Barnard,et al.  Data characteristics that determine classifier performance , 2006 .

[9]  Evgeniy Gabrilovich,et al.  Parameterized generation of labeled datasets for text categorization based on a hierarchical directory , 2004, SIGIR '04.

[10]  W. Stahel,et al.  Log-normal Distributions across the Sciences: Keys and Clues , 2001 .

[11]  Henry S. Baird,et al.  Document image defect models and their uses , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[12]  S. Fomin,et al.  Root systems and generalized associahedra , 2005, math/0505518.

[13]  R. Caflisch Monte Carlo and quasi-Monte Carlo methods , 1998, Acta Numerica.

[14]  A. Genz Methods for Generating Random Orthogonal Matrices , 2000 .

[15]  Yannis Theodoridis,et al.  On the Generation of Spatiotemporal Datasets , 1999 .

[16]  Harald Niederreiter,et al.  Monte-Carlo and Quasi-Monte Carlo Methods 1998 , 2000 .

[17]  Julia Lane,et al.  Synthetic Data and Confidentiality Protection , 2003 .

[18]  David G. Stork,et al.  Pattern Classification , 1973 .

[19]  Tibor Csendes,et al.  Multisection in Interval Branch-and-Bound Methods for Global Optimization – I. Theoretical Results , 2000, J. Glob. Optim..

[20]  Rui Xiao,et al.  Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems , 2005, KDD '05.

[21]  Heisuke Hironaka,et al.  Algebra and Geometry , 2006, AISC.