THE CHALLENGES OF VALIDATING MULTIVARIATE METHODS FOR PATTERN RECOGNITION

Multivariate methods for pattern recognition are increasingly used in data mining of complex biological processes such as in metabolomics and food science. Whereas the flexibility of such methods for example Support Vector Machines or Self Organising Maps or Partial Least Squares Discriminant Analysis allows for highly sophisticated models, there is a comparable problem of overfitting. Validation is therefore important. It is essential to distinguish between optimisation and validation. It is important to consider the challenges of data with high variable to sample ratios. Variable (or feature) selection can be problematic as if incorrectly done it can accidentally introduce over-optimistic results. Iterative but computationally intense methods are often needed to repeatedly generate training sets and even out the problems of outliers or mislabelled / atypical samples that could unduly influence the training or test sets. Finally performance criteria can be hard to define, as indicators of success depend in part on what is known about the data in advance, so the primary aim of a method may not necessarily be to reduce apparent error rates in test sets : many methods available appear over-attractive because they aim to provide an overoptimistic rather than realistic view especially in internal test sets that may not contain the features of future data.