Data generator based on RBF network

There are plenty of problems where the data available is scarce and expensive. We propose a generator of semi-artificial data with sim ilar properties to the original data which enables development and testing of different data mining algorithms and optimization of their parameters. The generated data allow a large scale experimentation and simulations without danger of overfitting. The proposed generator is based on RBF networks which learn sets of Gaussian kernels. Learned Gaussian kernels can be used in a generative mode to generate the data from the same distributions. To asses quality of the generated data we developed several workflows and used them to evaluate the statistical p roperties of the generated data, structural similarity and predictive similarit y using supervised and unsupervised learning techniques. To determine usability of the proposed generator we conducted a large scale evaluation using 51 UCI data sets. The results show a considerable similarity between the original and generated data and indicate that the method can be useful in several development and simulation scenarios.

[1]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[2]  Brian D. Ripley,et al.  Stochastic Simulation , 2005 .

[3]  Michael R. Berthold,et al.  Boosting the Performance of RBF Networks with Dynamic Decay Adjustment , 1994, NIPS.

[4]  Hao Yu,et al.  Fast and Efficient Second-Order Method for Training Radial Basis Function Networks , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Patrick Mair,et al.  Generating Nonnormal Multivariate Data Using Copulas: Applications to SEM , 2012, Multivariate behavioral research.

[6]  José Manuel Benítez,et al.  Neural Networks in R Using the Stuttgart Neural Network Simulator: RSNNS , 2012 .

[7]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[8]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[9]  Junfei Qiao,et al.  Adaptive Computation Algorithm for RBF Neural Network , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[10]  D. L. Reilly,et al.  A neural model for category learning , 1982, Biological Cybernetics.

[11]  Antanas Verikas,et al.  Mining data with random forests: A survey and results of new tests , 2011, Pattern Recognit..

[12]  Wolfgang Härdle,et al.  Multivariate and Semiparametric Kernel Regression , 1997 .

[13]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[14]  R. Nelsen An Introduction to Copulas , 1998 .

[15]  Pier Alda Ferrari,et al.  Simulating Ordinal Data , 2012, Multivariate behavioral research.

[16]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[17]  John Ruscio,et al.  Simulating Multivariate Nonnormal Data Using an Iterative Algorithm , 2008, Multivariate behavioral research.

[18]  Anura P. Jayasumana,et al.  On Characteristics and Modeling of P2P Resources with Correlated Static and Dynamic Attributes , 2011, 2011 IEEE Global Telecommunications Conference - GLOBECOM 2011.

[19]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.