Preliminary approach on synthetic data sets generation based on class separability measure

Usually, performance of classifiers is evaluated on real-world problems that mainly belong to public repositories. However, we ignore the inherent properties of these data and how they affect classifier behavior. Also, the high cost or the difficulty of experiments hinder the data collection, leading to complex data sets characterized by few instances, missing values, and imprecise data. The generation of synthetic data sets solves both issues and allows us to build problems with a minor cost and whose characteristics are predefined. This is useful to test system limitations in a controlled framework. This paper proposes to generate synthetic data sets based on data complexity. We rely on the length of the class boundary to build the data sets, obtaining a preliminary set of benchmarks to assess classifier accuracy. The study can be further matured to identify regions of competence for classifiers.

[1]  Tin Kam Ho,et al.  Data Complexity and Evolutionary Learning , 2006 .

[2]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Tin Kam Ho,et al.  On classifier domains of competence , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[4]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[5]  Rui Xiao,et al.  Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems , 2005, KDD '05.

[6]  Tin Kam Ho,et al.  Data Complexity Analysis for Classifier Combination , 2001, Multiple Classifier Systems.

[7]  Tin Kam Ho,et al.  On classifier domains of competence , 2004, ICPR 2004.

[8]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[9]  Horst Bunke,et al.  Comparing natural and synthetic training data for off-line cursive handwriting recognition , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[10]  Tin Kam Ho,et al.  Domain of competence of XCS classifier system in complexity measurement space , 2005, IEEE Transactions on Evolutionary Computation.

[11]  T. Ho,et al.  Data Complexity in Pattern Recognition , 2006 .