Generalization of the minimum covariance determinant algorithm for categorical and mixed data types

The minimum covariance determinant (MCD) algorithm is one of the most common techniques to detect anomalous or outlying observations. The MCD algorithm depends on two features of multivariate data: the determinant of a matrix (i.e., geometric mean of the eigenvalues) and Mahalanobis distances (MD). While the MCD algorithm is commonly used, and has many extensions, the MCD is limited to analyses of quantitative data and more specifically data assumed to be continuous. One reason why the MCD does not extend to other data types such as categorical or ordinal data is because there is not a well-defined MD for data types other than continuous data. To address the lack of MCD-like techniques for categorical or mixed data we present a generalization of the MCD. To do so, we rely on a multivariate technique called correspondence analysis (CA). Through CA we can define MD via singular vectors and also compute the determinant from CA’s eigenvalues. Here we define and illustrate a generalized MCD on categorical data and then show how our generalized MCD extends beyond categorical data to accommodate mixed data types (e.g., categorical, ordinal, and continuous). We illustrate this generalized MCD on data from two large scale projects: the Ontario Neurodegenerative Disease Research Initiative (ONDRI) and the Alzheimer’s Disease Neuroimaging Initiative (ADNI), with genetics (categorical), clinical instruments and surveys (categorical or ordinal), and neuroimaging (continuous) data. We also make R code and toy data available in order to illustrate our generalized MCD.

[1]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[2]  Benjamin Thyreau,et al.  Detecting outliers in high-dimensional neuroimaging datasets with robust covariance estimators , 2012, Medical Image Anal..

[3]  Kei Takeuchi,et al.  Projection Matrices, Generalized Inverse Matrices, and Singular Value Decomposition , 2011 .

[4]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[5]  Sara M. Schaal,et al.  MINOTAUR: A platform for the analysis and visualization of multivariate results from genome scans with R Shiny , 2016, bioRxiv.

[6]  Jianqing Fan,et al.  Large covariance estimation by thresholding principal orthogonal complements , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[7]  J. Morris The Clinical Dementia Rating (CDR) , 1993, Neurology.

[8]  H. Abdi,et al.  Multiple factor analysis: principal component analysis for multitable and multiblock data sets , 2013 .

[9]  Jérôme Pagès,et al.  Multiple factor analysis (AFMULT package) , 1994 .

[10]  Javier González,et al.  On the Generalization of the Mahalanobis Distance , 2013, CIARP.

[11]  B. Escofier Traitement simultané de variables qualitatives et quantitatives en analyse factorielle , 1979 .

[12]  Ashis SenGupta,et al.  Tests for standardized generalized variances of multivariate normal populations of possibly different dimensions , 1987 .

[13]  B. L. Roux,et al.  Multiple Correspondence Analysis , 2009 .

[14]  M. Greenacre,et al.  Multiple Correspondence Analysis and Related Methods , 2006 .

[15]  Ali S. Hadi,et al.  Detection of outliers , 2009 .

[16]  Hervé Abdi,et al.  An ExPosition of multivariate analysis with the singular value decomposition in R , 2014, Comput. Stat. Data Anal..

[17]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[18]  Mia Hubert,et al.  ROBPCA: A New Approach to Robust Principal Component Analysis , 2005, Technometrics.

[19]  Susan Holmes,et al.  Multivariate data analysis: The French way , 2008, 0805.2879.

[20]  David B. Stephenson,et al.  Correlation of spatial climate/weather maps and the advantages of using the Mahalanobis metric in predictions , 1997 .

[21]  P.G.M. Van der Heijden,et al.  A Combined Approach to Contingency Table Analysis Using Correspondence Analysis and Log-Linear Analysis , 1989 .

[22]  Brigitte Escofier Analyse factorielle en référence à un modèle. Application à l'analyse de tableaux d'échanges , 1984 .

[23]  E J Bedrick,et al.  Estimating the Mahalanobis Distance from Mixed Continuous and Discrete Data , 2000, Biometrics.

[24]  R. Brereton,et al.  The Mahalanobis distance and its relationship to principal component scores , 2015 .

[25]  P. Duncombe,et al.  Multivariate Descriptive Statistical Analysis: Correspondence Analysis and Related Techniques for Large Matrices , 1985 .

[26]  Nedret Billor,et al.  Finding multivariate outliers in fMRI time-series data , 2014, Comput. Biol. Medicine.

[27]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[28]  J. Dunlop,et al.  Partial least squares correspondence analysis: A framework to simultaneously analyze behavioral and genetic data. , 2016, Psychological methods.

[29]  Brendan McCane,et al.  Distance functions for categorical and mixed variables , 2008, Pattern Recognit. Lett..

[30]  Trevor Hastie,et al.  The Geometric Interpretation of Correspondence Analysis , 1987 .

[31]  M. Debruyne,et al.  Minimum covariance determinant , 2010 .

[32]  Michael Greenacre,et al.  Subset Correspondence Analysis , 2006 .

[33]  Brigitte Escofier Analyse de la différence entre deux mesures définies sur le produit de deux mêmes ensembles , 1983 .

[34]  Lorne Zinman,et al.  The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project , 2019, BMC Medical Research Methodology.

[35]  Olivier Klein,et al.  Detecting multivariate outliers: Use a robust variant of the Mahalanobis distance , 2018 .

[36]  Jérôme Pagès,et al.  Multiple factor analysis and clustering of a mixture of quantitative, categorical and frequency data , 2008, Comput. Stat. Data Anal..

[37]  H. Abdi,et al.  The survey of autobiographical memory (SAM): A novel measure of trait mnemonics in everyday life , 2013, Cortex.

[38]  Michael Greenacre,et al.  Data Doubling and Fuzzy Coding , 2014 .

[39]  D. W. Goodall A New Similarity Index Based on Probability , 1966 .

[40]  K. Berk Multivariate Descriptive Statistical Analysis—Correspondence Analysis and Related Techniques for Large Matrices (Ludovic Lebart, Alain Morineau, and Kenneth M. Warwick) , 1985 .

[41]  J. Barkmeijer,et al.  Singular vectors and estimates of the analysis‐error covariance metric , 1998 .

[42]  H. Abdi,et al.  Multiple Correspondence Analysis , 2006 .

[43]  Lorne Zinman,et al.  The Ontario Neurodegenerative Disease Research Initiative (ONDRI) , 2016, Canadian Journal of Neurological Sciences / Journal Canadien des Sciences Neurologiques.

[44]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[45]  J RousseeuwPeter,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[46]  A. R. de Leon,et al.  A generalized Mahalanobis distance for mixed data , 2005 .

[47]  R. Clarke,et al.  Theory and Applications of Correspondence Analysis , 1985 .

[48]  M Grassi,et al.  Correspondence analysis applied to grouped cohort data. , 1994, Statistics in medicine.

[49]  David M. Rocke,et al.  The Distribution of Robust Distances , 2005 .

[50]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[51]  Brian D. Ripley,et al.  Modern Applied Statistics with S Fourth edition , 2002 .

[52]  T. Zewotir,et al.  The application of subset correspondence analysis to address the problem of missing data in a study on asthma severity in childhood , 2014, Statistics in medicine.

[53]  P. Rousseeuw,et al.  The minimum regularized covariance determinant estimator , 2017, Statistics and Computing.

[54]  P. Garthwaite,et al.  Evaluating the Contributions of Individual Variables to a Quadratic Form , 2016, Australian & New Zealand journal of statistics.

[55]  Jean-Jacques Daudin,et al.  Generalization of the Mahalanobis distance in the mixed case , 1995 .

[56]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[57]  M. Hubert,et al.  A Deterministic Algorithm for Robust Location and Scatter , 2012 .

[58]  Michael Greenacre,et al.  Visualization and Verbalization of Data , 2014 .

[59]  M. Greenacre Correspondence analysis in practice , 1993 .

[60]  Amanda F. Mejia,et al.  PCA leverage: outlier detection for high‐dimensional functional magnetic resonance imaging data , 2015, Biostatistics.

[61]  Jan de Leeuw,et al.  Correspondence analysis of incomplete contingency tables , 1988 .

[62]  J. P. Benzécri,et al.  Sur le calcul des taux d'inertie dans l'analyse d'un questionnaire, addendum et erratum à [BIN. MULT.] , 1979 .

[63]  A D Roses,et al.  A TOMM40 variable-length polymorphism predicts the age of late-onset Alzheimer's disease , 2009, The Pharmacogenomics Journal.

[64]  Exploring series of multivariate censored temporal data through fuzzy coding and correspondence analysis , 2006, Statistics in medicine.