论文信息 - Using synthetic data safely in classification

Using synthetic data safely in classification

When is it safe to use synthetic training data in supervised classification? Trainable classifier technologies require large representative training sets consisting of samples labeled with their true class. Acquiring such training sets is difficult and costly. One way to alleviate this problem is to enlarge training sets by generating artificial, synthetic samples. Of course this immediately raises many questions, perhaps the first being "Why should we trust artificially generated data to be an accurate representative of the real distributions?" Other questions include "When will training on synthetic data work as well as - or better than training on real data ?". We distinguish between sample space (the set of real samples), parameter space (all samples that can be generated synthetically), and finally, feature space (the set of samples in terms of finite numerical values). In this paper, we discuss a series of experiments, in which we produced synthetic data in parameter space, that is, by convex interpolation among the generating parameters for samples and showed we could amplify real data to produce a classifier that is as accurate as a classifier trained on real data. Specifically, we have explored the feasibility of varying the generating parameters for Knuth's Metafont system to see if previously unseen fonts could also be recognized. We also varied parameters for an image quality model. We have found that training on interpolated data is for the most part safe, that is to say never produced more errors. Furthermore, the classifier trained on interpolated data often improved class accuracy.

Henry S. Baird | Jean Nonnemaker

[1] Henry S. Baird,et al. Document image defect models , 1995 .

[2] Bernard Victorri,et al. Transformation invariance in pattern recognition: Tangent distance and propagation , 2000 .

[3] Horst Bunke,et al. Document Image Analysis , 1994, Series in Machine Perception and Artificial Intelligence.

[4] Satoshi Naoi,et al. Low resolution character recognition by dual eigenspace and synthetic degraded patterns , 2004, HDP '04.

[5] Yann LeCun,et al. Transformation Invariance in Pattern Recognition-Tangent Distance and Tangent Propagation , 1996, Neural Networks: Tricks of the Trade.

[6] Tamás VARGA,et al. Effects of Training Set Expansion in Handwriting Recognition Using Synthetic Data , 2003 .

[7] Horst Bunke,et al. Comparing natural and synthetic training data for off-line cursive handwriting recognition , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[8] Nitesh V. Chawla,et al. SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9] Tin Kam Ho,et al. Large-Scale Simulation Studies in Image Pattern Recognition , 1997, IEEE Trans. Pattern Anal. Mach. Intell..