Degrees of freedom for combining regression with factor analysis

In the AGEMAP genomics study, researchers were interested in detecting genes related to age in a variety of tissue types. After not finding many age-related genes in some of the analyzed tissue types, the study was criticized for having low power. It is possible that the low power is due to the presence of important unmeasured variables, and indeed we find that a latent factor model appears to explain substantial variability not captured by measured covariates. We propose including the estimated latent factors in a multiple regression model. The key difficulty in doing so is assigning appropriate degrees of freedom to the estimated factors to obtain unbiased error variance estimators and enable valid hypothesis testing. When the number of responses is large relative to the sample size, treating the estimated factors like observed covariates leads to a downward bias in the variance estimates. Many ad-hoc solutions to this problem have been proposed in the literature without the backup of a careful theoretical analysis. Using recent results from random matrix theory, we derive a simple, easy to use expression for degrees of freedom. Our estimate gives a principled alternative to ad-hoc approaches in common use. Extensive simulation results show excellent agreement between the proposed estimator and its theoretical value. Applying our methodology to the AGEMAP genomics study, we found an order of magnitude increase in the number of significant genes. Although we focus on the AGEMAP study, the methods developed in this paper are widely applicable to other multivariate models, and thus are of independent interest.

[1]  Chloé Friguet,et al.  A Factor Model Approach to Multiple Testing Under Dependence , 2009 .

[2]  Matthew West,et al.  Bayesian factor regression models in the''large p , 2003 .

[3]  G. H. Freeman,et al.  Statistical methods for the analysis of genotype-environment interactions2 , 1973, Heredity.

[4]  David J. Bartholomew,et al.  Latent Variable Models and Factor Analysis: A Unified Approach , 2011 .

[5]  Fred A. van Eeuwijk,et al.  MULTIPLICATIVE INTERACTION IN GENERALIZED LINEAR MODELS , 1995 .

[6]  E. J. Williams THE INTERPRETATION OF INTERACTIONS IN FACTORIAL EXPERIMENTS , 1952 .

[7]  S. Péché,et al.  Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices , 2004, math/0403022.

[8]  Wojtek J. Krzanowski,et al.  Model Selection and Cross Validation in Additive Main Effect and Multiplicative Interaction Models , 2003 .

[9]  J. Stock,et al.  Macroeconomic Forecasting Using Diffusion Indexes , 2002 .

[10]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[11]  K. Gabriel,et al.  Generalised bilinear regression , 1998 .

[12]  H. F. Gollob,et al.  A statistical model which combines features of factor analytic and analysis of variance techniques , 1968, Psychometrika.

[13]  Peter D. Hoff,et al.  Model Averaging and Dimension Selection for the Singular Value Decomposition , 2006, math/0609042.

[14]  K. Gabriel,et al.  Least Squares Approximation of Matrices by Additive and Multiplicative Models , 1978 .

[15]  P. Cornelius,et al.  Estimation of general linear-bilinear models for two-way tables , 1997 .

[16]  R D Bock,et al.  High-dimensional multivariate probit analysis. , 1996, Biometrics.

[17]  Terry Elrod,et al.  A Factor-Analytic Probit Model for Representing the Market Structure in Panel Data , 1995 .

[18]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[19]  N. Pillai,et al.  Universality of covariance matrices , 2011, 1110.2501.

[20]  A. Owen,et al.  AGEMAP: A Gene Expression Database for Aging in Mice , 2007, PLoS genetics.

[21]  J. Mandel A New Analysis of Variance Model for Non-additive Data , 1971 .

[22]  Z. Bai,et al.  On the limit of the largest eigenvalue of the large dimensional sample covariance matrix , 1988 .

[23]  Nancy R. Zhang,et al.  Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data , 2013, 1301.2420.

[24]  W. G. Cochran The Comparison of Different Scales of Measurement for Experimental Results , 1943 .

[25]  J. Tukey The Future of Data Analysis , 1962 .

[26]  Weikai Yan,et al.  Biplots of Linear-Bilinear Models for Studying Crossover Genotype × Environment Interaction , 2002 .

[27]  R. Fisher,et al.  STUDIES IN CROP VARIATION , 2009 .

[28]  Raj Rao Nadakuditi,et al.  The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices , 2009, 0910.2120.

[29]  J. Mandel Partitioning of interaction in analysis of variance , 1969 .

[30]  S Y Lee,et al.  Bayesian estimation and test for factor analysis model with continuous and polytomous data in several populations. , 2001, The British journal of mathematical and statistical psychology.