Multivariate classification of neuroimaging data with nested subclasses: Biased accuracy and implications for hypothesis testing

Biological data sets are typically characterized by high dimensionality and low effect sizes. A powerful method for detecting systematic differences between experimental conditions in such multivariate data sets is multivariate pattern analysis (MVPA), particularly pattern classification. However, in virtually all applications, data from the classes that correspond to the conditions of interest are not homogeneous but contain subclasses. Such subclasses can for example arise from individual subjects that contribute multiple data points, or from correlations of items within classes. We show here that in multivariate data that have subclasses nested within its class structure, these subclasses introduce systematic information that improves classifiability beyond what is expected by the size of the class difference. We analytically prove that this subclass bias systematically inflates correct classification rates (CCRs) of linear classifiers depending on the number of subclasses as well as on the portion of variance induced by the subclasses. In simulations, we demonstrate that subclass bias is highest when between-class effect size is low and subclass variance high. This bias can be reduced by increasing the total number of subclasses. However, we can account for the subclass bias by using permutation tests that explicitly consider the subclass structure of the data. We illustrate our result in several experiments that recorded human EEG activity, demonstrating that parametric statistical tests as well as typical trial-wise permutation fail to determine significance of classification outcomes correctly.

[1]  Marti J. Anderson,et al.  Permutation tests for multi-factorial analysis of variance , 2003 .

[2]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[3]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .

[4]  Christian Leibold,et al.  Classification based hypothesis testing in neuroscience: Below‐chance level classification rates and overlooked statistical properties of linear parametric classifiers , 2016, Human brain mapping.

[5]  Stephen M. Smith,et al.  Multi-level block permutation , 2015, NeuroImage.

[6]  Jonathan D. Cohen,et al.  Confounds in multivariate pattern analysis: Theory and rule representation case study , 2013, NeuroImage.

[7]  J. S. Guntupalli,et al.  Decoding neural representational spaces using multivariate pattern analysis. , 2014, Annual review of neuroscience.

[8]  Sally Galbraith,et al.  A Study of Clustered Data and Approaches to Its Analysis , 2010, The Journal of Neuroscience.

[9]  Klaus-Robert Müller,et al.  Analyzing neuroimaging data with subclasses: A shrinkage approach , 2016, NeuroImage.

[10]  Yi Chen,et al.  Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (MVPA): Random permutations and cluster size control , 2011, NeuroImage.

[11]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[12]  Matthijs Verhage,et al.  A solution to dependency: using multilevel analysis to accommodate nested data , 2014, Nature Neuroscience.

[13]  Sean M. Polyn,et al.  Beyond mind-reading: multi-voxel pattern analysis of fMRI data , 2006, Trends in Cognitive Sciences.

[14]  MARTI J. ANDERSONa,et al.  PERMUTATION TESTS FOR MULTIFACTORIAL ANALYSIS OF VARIANCE , 2008 .

[15]  Chris I. Baker,et al.  Deconstructing multivariate decoding for the study of brain function , 2017, NeuroImage.

[16]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[17]  J. Haynes A Primer on Pattern-Based Approaches to fMRI: Principles, Pitfalls, and Perspectives , 2015, Neuron.

[18]  Maximilian Riesenhuber,et al.  Multivariate Pattern Analysis Reveals Category-Related Organization of Semantic Representations in Anterior Temporal Cortex , 2016, The Journal of Neuroscience.

[19]  Aleix M. Martínez,et al.  Subclass discriminant analysis , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Arnaud Delorme,et al.  EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis , 2004, Journal of Neuroscience Methods.

[21]  Christian Leibold,et al.  Decoding cognitive concepts from neuroimaging data using multivariate pattern analysis , 2017, NeuroImage.

[22]  Robert P. W. Duin,et al.  Using two-class classifiers for multiclass classification , 2002, Object recognition supported by user interaction for service robots.

[23]  Polina Golland,et al.  Coping with confounds in multivoxel pattern analysis: What should we do about reaction time differences? A comment on Todd, Nystrom & Cohen 2013 , 2014, NeuroImage.

[24]  Stanley E Lazic,et al.  The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? , 2010, BMC Neuroscience.

[25]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.