An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable

Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application of the classifier with optimized parameters to a separate test set for estimating the classifier’s generalization performance. With limited data, this separation of test data creates a difficult trade-off between having more statistical power in estimating generalization performance versus choosing better parameters and fitting a better model. We propose a novel approach that we term “Cross-validation and cross-testing” improving this trade-off by re-using test data without biasing classifier performance. The novel approach is validated using simulated data and electrophysiological recordings in humans and rodents. The results demonstrate that the approach has a higher probability of discovering significant results than the standard approach of cross-validation and testing, while maintaining the nominal alpha level. In contrast to nested cross-validation, which is maximally efficient in re-using data, the proposed approach additionally maintains the interpretability of individual parameters. Taken together, we suggest an addition to currently used machine learning approaches which may be particularly useful in cases where model weights do not require interpretation, but parameters do.

[1]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[2]  M. D. Ernst Permutation Methods: A Basis for Exact Inference , 2004 .

[3]  John-Dylan Haynes,et al.  Valid population inference for information-based imaging: From the second-level t-test to prevalence inference , 2015, NeuroImage.

[4]  Raul Vicente,et al.  Personality cannot be predicted from the power of resting state EEG , 2014, Front. Hum. Neurosci..

[5]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[6]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[7]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[8]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[9]  J. Haynes Brain Reading: Decoding Mental States From Brain Activity In Humans , 2011 .

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  M. Kubát An Introduction to Machine Learning , 2017, Springer International Publishing.

[12]  David Haussler,et al.  Proceedings of the fifth annual workshop on Computational learning theory , 1992, COLT 1992.

[13]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[14]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[15]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[16]  Tom M. Mitchell,et al.  Machine learning classifiers and fMRI: A tutorial overview , 2009, NeuroImage.

[17]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[18]  Kenneth D. Harris,et al.  Data Sharing for Computational Neuroscience , 2008, Neuroinformatics.