Statistical Properties of Multivariate Distance Matrix Regression for High-Dimensional Data Analysis

Multivariate distance matrix regression (MDMR) analysis is a statistical technique that allows researchers to relate P variables to an additional M factors collected on N individuals, where P ≫ N. The technique can be applied to a number of research settings involving high-dimensional data types such as DNA sequence data, gene expression microarray data, and imaging data. MDMR analysis involves computing the distance between all pairs of individuals with respect to P variables of interest and constructing an N × N matrix whose elements reflect these distances. Permutation tests can be used to test linear hypotheses that consider whether or not the M additional factors collected on the individuals can explain variation in the observed distances between and among the N individuals as reflected in the matrix. Despite its appeal and utility, properties of the statistics used in MDMR analysis have not been explored in detail. In this paper we consider the level accuracy and power of MDMR analysis assuming different distance measures and analysis settings. We also describe the utility of MDMR analysis in assessing hypotheses about the appropriate number of clusters arising from a cluster analysis.

[1]  K. Walters,et al.  A comparison of statistical approaches to analyzing community convergence between natural and constructed oyster reefs , 2006 .

[2]  Young Hyun,et al.  Visualising very large phylogenetic trees in three dimensional hyperbolic space , 2004, BMC Bioinformatics.

[3]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[4]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[5]  Ondrej Libiger,et al.  Generalized Analysis of Molecular Variance , 2007, PLoS genetics.

[6]  Alain Calvet,et al.  Molecular Property eXplorer: A Novel Approach to Visualizing SAR Using Tree-Maps and Heatmaps , 2005, J. Chem. Inf. Model..

[7]  Geert Trooskens,et al.  Phylogenetic trees: visualizing, customizing and detecting incongruence , 2005, Bioinform..

[8]  Markus Neuhäuser,et al.  Permutation Tests , 2011, International Encyclopedia of Statistical Science.

[9]  John C. Gower,et al.  Analysis of distance for structured multivariate data and extensions to multivariate analysis of variance , 1999 .

[10]  Marti J. Anderson,et al.  A new method for non-parametric multivariate analysis of variance in ecology , 2001 .

[11]  Charlotte H. Mason,et al.  Collinearity, power, and interpretation of multiple regression analysis. , 1991 .

[12]  Karl-Heinz Jockel,et al.  Finite Sample Properties and Asymptotic Efficiency of Monte Carlo Tests , 1986 .

[13]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Brian H. McArdle,et al.  FITTING MULTIVARIATE MODELS TO COMMUNITY DATA: A COMMENT ON DISTANCE‐BASED REDUNDANCY ANALYSIS , 2001 .

[15]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .

[16]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[17]  E. Lewis,et al.  Renoprotective effect of the angiotensin-receptor antagonist irbesartan in patients with nephropathy due to type 2 diabetes. , 2001, The New England journal of medicine.

[18]  Andrew G Clark,et al.  Genomics of the evolutionary process. , 2006, Trends in ecology & evolution.

[19]  Eugene S. Edgington,et al.  Randomization Tests , 2011, International Encyclopedia of Statistical Science.

[20]  Patrik D'haeseleer,et al.  How does gene expression clustering work? , 2005, Nature Biotechnology.

[21]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[22]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[23]  Aloysius J. Phillips,et al.  Homology assessment and molecular sequence alignment , 2006, J. Biomed. Informatics.

[24]  Matthew A. Zapala,et al.  Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables , 2006, Proceedings of the National Academy of Sciences.

[25]  N. Schork,et al.  Generalized genomic distance-based regression methodology for multilocus association analysis. , 2006, American journal of human genetics.

[26]  B. Manly Randomization, Bootstrap and Monte Carlo Methods in Biology , 2018 .

[27]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .

[28]  Brian Everitt,et al.  Principles of Multivariate Analysis , 2001 .

[29]  Richard F. Gunst,et al.  Regresion analysis with multicollinear predictor variables: definition, derection, and effects , 1983 .