Effects of sample size on classifier design: quadratic and neural network classifiers

Classifier design is one of the important steps in the development of computer-aided diagnosis (CAD) programs. In this study, we performed simulation studies to evaluate the dependence of the classifier performance on the design sample size, feature space dimensionality, and classifier complexity. The performance of a classifier is quantified by the area (Az) under the receiver operating characteristic (ROC) curve. Two types of non-linear classifiers, the quadratic discriminants and the backpropagation neural networks, were examined and their performances were compared to those of the linear discriminant classifiers under similar input conditions. A feature space with multivariate normal distributions for the two classes of feature vectors was assumed. A finite sample (Nt) of the normal and abnormal classes was randomly drawn form the populations. A modified cross-validation resampling scheme was used to design the classifiers. By randomly partitioning the available sample set into a training and a test set, a classifier was trained with the design samples and its performance was evaluated by the resubstitution technique and also by testing with the independent test set. For a finite design sample size, it was found that the classifier performance was biased optimistically by resubstitution and pessimistically by testing with the independent set. When the design sample set is sufficiently large, the Az-versus-1/Nt relationship is approximately linear. The range of Nt in which the linear approximation holds depends on the classifier, the dimensionality of the feature space, and the feature distributions. We analyzed the Az-versus-1/Nt relationship under a variety of input conditions. The study provides useful information for the design of classifiers in the development of CAD algorithms and other classification problems.