Sparse probit linear mixed model

Linear mixed models (LMMs) are important tools in statistical genetics. When used for feature selection, they allow to find a sparse set of genetic traits that best predict a continuous phenotype of interest, while simultaneously correcting for various confounding factors such as age, ethnicity and population structure. Formulated as models for linear regression, LMMs have been restricted to continuous phenotypes. We introduce the sparse probit linear mixed model (Probit-LMM), where we generalize the LMM modeling paradigm to binary phenotypes. As a technical challenge, the model no longer possesses a closed-form likelihood function. In this paper, we present a scalable approximate inference algorithm that lets us fit the model to high-dimensional data sets. We show on three real-world examples from different domains that in the setup of binary labels, our algorithm leads to better prediction accuracies and also selects features which show less correlation with the confounding factors.

[1]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: Sensitivity Analysis and Bounds , 2015 .

[2]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[3]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010 .

[4]  C. I. Bliss,et al.  THE METHOD OF PROBITS. , 1934, Science.

[5]  A. Prékopa On logarithmic concave measures and functions , 1973 .

[6]  H. D. Patterson,et al.  Recovery of inter-block information when block sizes are unequal , 1971 .

[7]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[8]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[9]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010, Nature.

[10]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[11]  Robin Thompson,et al.  Estimation of genetic parameters. , 2005 .

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  D. Cox The Regression Analysis of Binary Sequences , 1958 .

[14]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[15]  N. Breslow,et al.  Approximate inference in generalized linear mixed models , 1993 .

[16]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[17]  L. Penrose,et al.  THE CORRELATION BETWEEN RELATIVES ON THE SUPPOSITION OF MENDELIAN INHERITANCE , 2022 .

[18]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[19]  Bjarni J. Vilhjálmsson,et al.  Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines , 2010 .

[20]  Bjarni J. Vilhjálmsson,et al.  The nature of confounding in genome-wide association studies , 2012, Nature Reviews Genetics.

[21]  B. Shepherd,et al.  GUIDO IMBENS, DONALD RUBIN, Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. New York: Cambridge University Press. , 2016, Biometrics.

[22]  R. Fisher XV.—The Correlation between Relatives on the Supposition of Mendelian Inheritance. , 1919, Transactions of the Royal Society of Edinburgh.

[23]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[24]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[25]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[26]  Dimitri P. Bertsekas,et al.  On the Douglas—Rachford splitting method and the proximal point algorithm for maximal monotone operators , 1992, Math. Program..

[27]  Virginia Pascual,et al.  An Interferon-Inducible Neutrophil-Driven Blood Transcriptional Signature in Human Tuberculosis , 2010, Nature.

[28]  Katherine A. Heller,et al.  Evaluating Bayesian and L1 Approaches for Sparse Unsupervised Learning , 2011, ICML.

[29]  C. Chow,et al.  Applying compressed sensing to genome-wide association studies , 2014, GigaScience.

[30]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[31]  A. Rao,et al.  Estimation of Genetic Parameters: principles , 2003 .

[32]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[33]  Konrad Rieck,et al.  DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket , 2014, NDSS.

[34]  Yaniv Plan,et al.  One‐Bit Compressed Sensing by Linear Programming , 2011, ArXiv.

[35]  David J. Hand,et al.  Statistical Classification Methods in Consumer Credit Scoring: a Review , 1997 .

[36]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[37]  John D. Storey,et al.  Testing for genetic associations in arbitrarily structured populations , 2014 .

[38]  Peter Kraft,et al.  Replication in genome-wide association studies. , 2009, Statistical science : a review journal of the Institute of Mathematical Statistics.

[39]  J. Pearl Causal inference in statistics: An overview , 2009 .

[40]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[41]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[42]  Richard G. Baraniuk,et al.  1-Bit compressive sensing , 2008, 2008 42nd Annual Conference on Information Sciences and Systems.

[43]  D. L. Donoho,et al.  Compressed sensing , 2006, IEEE Trans. Inf. Theory.

[44]  Oliver Stegle,et al.  A Lasso multi-marker mixed model for association mapping with population structure correction , 2013, Bioinform..

[45]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[46]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[47]  J. Cunningham,et al.  Gaussian Probabilities and Expectation Propagation , 2011, 1111.6832.

[48]  Neil D. Lawrence,et al.  Joint Modelling of Confounding Factors and Prominent Genetic Regulators Provides Increased Accuracy in Genetical Genomics Studies , 2012, PLoS Comput. Biol..

[49]  Karsten M. Borgwardt,et al.  ccSVM: correcting Support Vector Machines for confounding factors in biological data classification , 2011, Bioinform..

[50]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[51]  William J. Astle,et al.  Population Structure and Cryptic Relatedness in Genetic Association Studies , 2009, 1010.4681.

[52]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[53]  D. Madigan,et al.  Sparse Bayesian Classifiers for Text Categorization , 2003 .

[54]  Matthias W. Seeger,et al.  Large Scale Bayesian Inference and Experimental Design for Sparse Linear Models , 2011, SIAM J. Imaging Sci..

[55]  N. Meinshausen,et al.  A multi-marker association method for genome-wide association studies without the need for population structure correction , 2016, Nature Communications.

[56]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: A General Method for Estimating Sampling Variances for Standard Estimators for Average Causal Effects , 2015 .

[57]  Emmanuel J. Candès,et al.  Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? , 2004, IEEE Transactions on Information Theory.

[58]  Aisha Ragab,et al.  On multivariate generalized logistic distribution , 1991 .

[59]  M. Stephens,et al.  Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies , 2012 .