The joint lasso: high-dimensional regression for group structured data

We consider high-dimensional regression over subgroups of observations. Our work is motivated by biomedical problems, where subsets of samples, representing for example disease subtypes, may differ with respect to underlying regression models. In the high-dimensional setting, estimating a different model for each subgroup is challenging due to limited sample sizes. Focusing on the case in which subgroup-specific models may be expected to be similar but not necessarily identical, we treat subgroups as related problem instances and jointly estimate subgroup-specific regression coefficients. This is done in a penalized framework, combining an $\ell_1$ term with an additional term that penalizes differences between subgroup-specific coefficients. This gives solutions that are globally sparse but that allow information-sharing between the subgroups. We present algorithms for estimation and empirical results on simulated data and using Alzheimer's disease, amyotrophic lateral sclerosis, and cancer datasets. These examples demonstrate the gains joint estimation can offer in prediction as well as in providing subgroup-specific sparsity patterns.

[1]  Joe W. Gray,et al.  Joint estimation of multiple networks from time course data , 2013 .

[2]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[3]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[4]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[5]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[6]  M. Weiner,et al.  Neuroimaging markers for the prediction and early diagnosis of Alzheimer's disease dementia , 2011, Trends in Neurosciences.

[7]  Michael W. Weiner,et al.  Crowdsourced estimation of cognitive decline and resilience in Alzheimer's disease , 2016, Alzheimer's & Dementia.

[8]  Torsten Hothorn,et al.  A unified framework of constrained regression , 2014, Stat. Comput..

[9]  Patrick Danaher,et al.  The joint graphical lasso for inverse covariance estimation across multiple classes , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[10]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[11]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[12]  Guangchuang Yu,et al.  clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[13]  Xi Chen,et al.  Graph-Structured Multi-task Regression and an Efficient Optimization Method for General Fused Lasso , 2010, ArXiv.

[14]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[15]  L. Wasserman,et al.  HIGH DIMENSIONAL VARIABLE SELECTION. , 2007, Annals of statistics.

[16]  Holger Hoefling A Path Algorithm for the Fused Lasso Signal Approximator , 2009, 0910.0526.

[17]  Johann S. Hawe,et al.  Crowdsourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression , 2014, Nature Biotechnology.

[18]  Sach Mukherjee,et al.  Two-Sample Testing in High-Dimensional Models , 2012 .

[19]  Matthew E Ritchie,et al.  Integrative analysis of RUNX1 downstream pathways and target genes , 2008, BMC Genomics.

[20]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[21]  C. Jack,et al.  Ways toward an early diagnosis in Alzheimer’s disease: The Alzheimer’s Disease Neuroimaging Initiative (ADNI) , 2005, Alzheimer's & Dementia.

[22]  Adam A. Margolin,et al.  The Cancer Cell Line Encyclopedia enables predictive modeling of anticancer drug sensitivity , 2012, Nature.

[23]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[24]  Sach Mukherjee,et al.  Two-sample testing in high dimensions , 2017 .

[25]  Jieping Ye,et al.  An efficient algorithm for a class of fused lasso problems , 2010, KDD.

[26]  Xiaohui Xie,et al.  Split Bregman method for large scale fused Lasso , 2010, Comput. Stat. Data Anal..

[27]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[28]  Jim Q. Smith,et al.  Exact estimation of multiple directed acyclic graphs , 2014, Stat. Comput..

[29]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.