Selecting training instances for supervised classification

Several experimental studies have tested the relative merits of various supervised machine learning models. Comparisons have been made along dimensions that include model complexity, prediction accuracy, training set size, and training time. Only limited work has been done to study the effect of training set exemplar typicality on model performance. We present experimental results obtained in testing C4.5, SX-WEB, a backpropagation newal network and linear discriminant analysis using a real-valued and a mixed form of a medical data set. We generated training sets of highly typical, widely-varied and atypical exemplars for both data sets. We tested the classification accuracy of each model using the generated training sets. Test set accuracy levels ranged between 76% and 86% when each model was trained with typical or varied training sets. The accuracy levels for C4.5, backpropagation neural net and discriminant analysis dropped significantly when atypical training sets were used. In contrast, with the exception of one test, SX-WEB was unaffected by training set choice. When comparing the correctness of each model, SX WEB showed the best overall performance. We conclude this paper with directions for future research.