Integrative analysis of multiple diverse omics datasets by sparse group multitask regression

A variety of high throughput genome-wide assays enable the exploration of genetic risk factors underlying complex traits. Although these studies have remarkable impact on identifying susceptible biomarkers, they suffer from issues such as limited sample size and low reproducibility. Combining individual studies of different genetic levels/platforms has the promise to improve the power and consistency of biomarker identification. In this paper, we propose a novel integrative method, namely sparse group multitask regression, for integrating diverse omics datasets, platforms, and populations to identify risk genes/factors of complex diseases. This method combines multitask learning with sparse group regularization, which will: (1) treat the biomarker identification in each single study as a task and then combine them by multitask learning; (2) group variables from all studies for identifying significant genes; (3) enforce sparse constraint on groups of variables to overcome the “small sample, but large variables” problem. We introduce two sparse group penalties: sparse group lasso and sparse group ridge in our multitask model, and provide an effective algorithm for each model. In addition, we propose a significance test for the identification of potential risk genes. Two simulation studies are performed to evaluate the performance of our integrative method by comparing it with conventional meta-analysis method. The results show that our sparse group multitask method outperforms meta-analysis method significantly. In an application to our osteoporosis studies, 7 genes are identified as significant genes by our method and are found to have significant effects in other three independent studies for validation. The most significant gene SOD2 has been identified in our previous osteoporosis study involving the same expression dataset. Several other genes such as TREML2, HTR1E, and GLO1 are shown to be novel susceptible genes for osteoporosis, as confirmed from other studies.

[1]  Stephen J. Wright,et al.  Sparse Reconstruction by Separable Approximation , 2008, IEEE Transactions on Signal Processing.

[2]  Xihong Lin,et al.  JOINT ANALYSIS OF SNP AND GENE EXPRESSION DATA IN GENETIC ASSOCIATION STUDIES OF COMPLEX DISEASES. , 2014, The annals of applied statistics.

[3]  Hong-Wen Deng,et al.  An integrative study ascertained SOD2 as a susceptibility gene for osteoporosis in Chinese , 2011, Journal of bone and mineral research : the official journal of the American Society for Bone and Mineral Research.

[4]  Junfeng Yang,et al.  Alternating Direction Algorithms for 1-Problems in Compressive Sensing , 2009, SIAM J. Sci. Comput..

[5]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[6]  Zoubin Ghahramani,et al.  Bayesian correlated clustering to integrate multiple datasets , 2012, Bioinform..

[7]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[8]  Wotao Yin,et al.  Group sparse optimization by alternating direction method , 2013, Optics & Photonics - Optical Engineering + Applications.

[9]  R. Jiang,et al.  Integrating human omics data to prioritize candidate genes , 2013, BMC Medical Genomics.

[10]  Mark W. Lipsey,et al.  Practical Meta-Analysis , 2000 .

[11]  Peter Donnelly,et al.  HAPGEN2: simulation of multiple disease SNPs , 2011, Bioinform..

[12]  Alexey I. Nesvizhskii,et al.  Reconstructing targetable pathways in lung cancer by integrating diverse omics data , 2013, Nature Communications.

[13]  Jian Huang,et al.  A Selective Review of Group Selection in High-Dimensional Models. , 2012, Statistical science : a review journal of the Institute of Mathematical Statistics.

[14]  Ernie Esser,et al.  Applications of Lagrangian-Based Alternating Direction Methods and Connections to Split Bregman , 2009 .

[15]  Jian Huang,et al.  Integrative analysis and variable selection with multiple high-dimensional data sets. , 2011, Biostatistics.

[16]  Jian Huang,et al.  Integrative prescreening in analysis of multiple cancer genomic studies , 2012, BMC Bioinformatics.

[17]  Serkalem Demissie,et al.  Genome‐wide association of an integrated osteoporosis‐related phenotype: Is there evidence for pleiotropic genes? , 2012, Journal of bone and mineral research : the official journal of the American Society for Bone and Mineral Research.

[18]  Yan Guo,et al.  Genome-wide association and follow-up replication studies identified ADAMTS18 and TGFBR3 as bone mass candidate genes in different ethnic groups. , 2009, American journal of human genetics.

[19]  Shuiwang Ji,et al.  SLEP: Sparse Learning with Efficient Projections , 2011 .

[20]  M. Colonna,et al.  The TREM receptor family and signal integration , 2006, Nature Immunology.

[21]  Hao He,et al.  Network-based investigation of genetic modules associated with functional brain networks in schizophrenia , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[22]  Yun Li,et al.  METAL: fast and efficient meta-analysis of genomewide association scans , 2010, Bioinform..

[23]  Hui Jiang,et al.  An in vivo genome wide gene expression study of circulating monocytes suggested GBP1, STAT1 and CXCL10 as novel risk genes for the differentiation of peak bone mass. , 2009, Bone.

[24]  J. Ioannidis,et al.  Meta-analysis methods for genome-wide association studies and beyond , 2013, Nature Reviews Genetics.

[25]  Jian Huang,et al.  Sparse group penalized integrative analysis of multiple cancer prognosis datasets. , 2013, Genetics research.

[26]  Daoqiang Zhang,et al.  Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer's disease , 2012, NeuroImage.

[27]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Jennifer J Westendorf,et al.  Histone deacetylases in skeletal development and bone mass maintenance. , 2011, Gene.

[29]  Lin S. Chen,et al.  Insights into colon cancer etiology via a regularized approach to gene set analysis of GWAS data. , 2010, American journal of human genetics.

[30]  Olga G. Troyanskaya,et al.  A scalable method for integration and functional analysis of multiple microarray datasets , 2006, Bioinform..

[31]  Vince D. Calhoun,et al.  Group sparse canonical correlation analysis for genomic data integration , 2013, BMC Bioinformatics.

[32]  Mark W. Schmidt,et al.  GROUP SPARSITY VIA LINEAR-TIME PROJECTION , 2008 .

[33]  Bahram Parvin,et al.  Sparse multitask regression for identifying common mechanism of response to therapeutic targets , 2010, Bioinform..

[34]  T. Furey,et al.  Integrating genetic and gene expression evidence into genome-wide association analysis of gene sets. , 2011, Genome research.

[35]  Hiroshi Takayanagi,et al.  TREM2 and β-Catenin Regulate Bone Homeostasis by Controlling the Rate of Osteoclastogenesis , 2012, The Journal of Immunology.

[36]  Hui Shen,et al.  A Novel Pathophysiological Mechanism for Osteoporosis Suggested by an in Vivo Gene Expression Study of Circulating Monocytes* , 2005, Journal of Biological Chemistry.

[37]  Makoto Muroi,et al.  The identification of an osteoclastogenesis inhibitor through the inhibition of glyoxalase I , 2008, Proceedings of the National Academy of Sciences.

[38]  Yonina C. Eldar,et al.  C-HiLasso: A Collaborative Hierarchical Sparse Modeling Framework , 2010, IEEE Transactions on Signal Processing.

[39]  E. Candes,et al.  11-magic : Recovery of sparse signals via convex programming , 2005 .

[40]  C. Greenwood,et al.  Data Integration in Genetics and Genomics: Methods and Challenges , 2009, Human genomics and proteomics : HGP.

[41]  David Horn,et al.  Histone deacetylases. , 2008, Advances in experimental medicine and biology.

[42]  A. Chinnaiyan,et al.  Integrative analysis of the cancer transcriptome , 2005, Nature Genetics.

[43]  Xiangding Chen,et al.  Gene Expression Profiling in Monocytes and SNP Association Suggest the Importance of the Gene for Osteoporosis in Both Chinese and Caucasians , 2009, Journal of bone and mineral research : the official journal of the American Society for Bone and Mineral Research.

[44]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[45]  Chih-Wen Cheng,et al.  Multiscale Integration of -Omic, Imaging, and Clinical Data in Biomedical Informatics , 2012, IEEE Reviews in Biomedical Engineering.