The safe use of synthetic data in classification

When is it safe to use synthetic data in supervised classification? Trainable classifier technologies require large representative training sets consisting of samples labeled with their true class. This is in the context of supervised classification in which classifiers are designed fully automatically by learning from a file of labeled training samples. Acquiring such training sets is difficult and costly. One way to alleviate this problem is to enlarge training sets by generating artificial, synthetic samples. Of course this immediately raises many questions, perhaps the first being “Why should we trust artificially generated data to be an accurate representative of the real distributions?” Other questions include “When will training on synthetic data work as well as—or better than—training on real data?” We distinguish between sample space (the set of all real samples), parameter or generator space (samples that can be generated synthetically), and finally, feature space (samples described by numerical feature values). Synthetic data can be produced in what we call parameter space by varying the parameters that control their generation. We are interested in exploring how generator and feature space relate to one another. Specifically, we have explored the feasibility of varying the generating parameters for typefaces in Knuth's Metafont system to see if previously unseen fonts could also be recognized. Generally, we have attempted to formalize a reliable methodology for the generation and use of synthetic data in supervised classification. We have designed and carried out systematically a family of experiments in which pure typefaces already widely used are supplemented with synthetically generated typefaces interpolated in generator or parameter space in the Metafont system. We also vary image quality widely using a parameterized image defect generator. (Abstract shortened by UMI.)

[1]  Daniel P. Lopresti,et al.  Issues in Ground-Truthing Graphic Documents , 2001, GREC.

[2]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[3]  Horst Bunke,et al.  Comparing natural and synthetic training data for off-line cursive handwriting recognition , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[4]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[5]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[6]  Yann LeCun,et al.  Transformation Invariance in Pattern Recognition-Tangent Distance and Tangent Propagation , 1996, Neural Networks: Tricks of the Trade.

[7]  Herna L. Viktor,et al.  Multiple Classifier Prediction Improvements against Imbalanced Datasets through Added Synthetic Examples , 2004, SSPR/SPR.

[8]  Rafael Llobet,et al.  Training Set Expansion in Handwritten Character Recognition , 2002, SSPR/SPR.

[9]  Herna L. Viktor,et al.  Boosting with Data Generation: Improving the Classification of Hard to Learn Examples , 2004, IEA/AIE.

[10]  D. Hofstadter,et al.  Letter Spirit: an Emergent Model of the Perception and Creation of Alphabetic Style , 1993 .

[11]  Roberto Alejo,et al.  Analysis of new techniques to obtain quality training sets , 2003, Pattern Recognit. Lett..

[12]  David G. Stork,et al.  Pattern Classification , 1973 .

[13]  Kazuhiko Yamamoto,et al.  Structured Document Image Analysis , 1992, Springer Berlin Heidelberg.

[14]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[15]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[16]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[17]  Alan C. Bovik,et al.  Information Theoretic Approaches to Image Quality Assessment , 2005 .

[18]  Satoshi Naoi,et al.  Low resolution character recognition by dual eigenspace and synthetic degraded patterns , 2004, HDP '04.

[19]  Hong Zhu,et al.  Software unit test coverage and adequacy , 1997, ACM Comput. Surv..

[20]  Tamás VARGA,et al.  Effects of Training Set Expansion in Handwriting Recognition Using Synthetic Data , 2003 .

[21]  Tin Kam Ho,et al.  Large-Scale Simulation Studies in Image Pattern Recognition , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Henry S. Baird,et al.  Document image defect models , 1995 .

[23]  Minoru Mori,et al.  GENERATING NEW SAMPLES FROM HANDWRITTEN NUMERALS BASED ON POINT CORRESPONDENCE , 2004 .

[24]  Jianchang Mao,et al.  Improving OCR performance using character degradation models and boosting algorithm , 1997, Pattern Recognit. Lett..

[25]  Karl Sims,et al.  Handwritten Character Classification Using Nearest Neighbor in Large Databases , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Bernard Victorri,et al.  Transformation invariance in pattern recognition: Tangent distance and propagation , 2000 .

[27]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.