The dangers of creating false classifications due to noise in electronic nose and similar multivariate analyses

Abstract Randomly generated data with the error limits of 1–10% along with experimental data was employed to demonstrate the dangers of over-fitting data which creates artificial differentiation. Analysis of variance (ANOVA), principal components analysis (PCA), and discriminant function analysis (DFA) were employed for the data analysis. In cases, where the ratio of samples to variables (features) falls below six, single class systems containing only random noise and random groupings can be misclassified into more than a single group when the discriminate techniques are employed. The smaller the group size, the more erroneous classifications are made. Larger sample sizes minimize the random noise and allow the true differences to show. A minimum number of variable (features) should be employed with developing classification models to avoid over-fitting data. The ratio of data points to variables should be at least six to avoid over-fitting classification errors with validation of the model using data points not used in generating the model.