Molecular pathway identification using biological network-regularized logistic models

BackgroundSelecting genes and pathways indicative of disease is a central problem in computational biology. This problem is especially challenging when parsing multi-dimensional genomic data. A number of tools, such as L1-norm based regularization and its extensions elastic net and fused lasso, have been introduced to deal with this challenge. However, these approaches tend to ignore the vast amount of a priori biological network information curated in the literature.ResultsWe propose the use of graph Laplacian regularized logistic regression to integrate biological networks into disease classification and pathway association problems. Simulation studies demonstrate that the performance of the proposed algorithm is superior to elastic net and lasso analyses. Utility of this algorithm is also validated by its ability to reliably differentiate breast cancer subtypes using a large breast cancer dataset recently generated by the Cancer Genome Atlas (TCGA) consortium. Many of the protein-protein interaction modules identified by our approach are further supported by evidence published in the literature. Source code of the proposed algorithm is freely available at http://www.github.com/zhandong/Logit-Lapnet.ConclusionLogistic regression with graph Laplacian regularization is an effective algorithm for identifying key pathways and modules associated with disease subtypes. With the rapid expansion of our knowledge of biological regulatory networks, this approach will become more accurate and increasingly useful for mining transcriptomic, epi-genomic, and other types of genome wide association studies.

[1]  N. Dubrawsky Cancer statistics , 1989, CA: a cancer journal for clinicians.

[2]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[3]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[4]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[5]  T. Hastie,et al.  Classification of gene microarrays by penalized logistic regression. , 2004, Biostatistics.

[6]  T. Hubbard,et al.  A census of human cancer genes , 2004, Nature Reviews Cancer.

[7]  Cheng Cheng,et al.  Gene-expression patterns in drug-resistant acute lymphoblastic leukemia cells and response to treatment. , 2004, The New England journal of medicine.

[8]  Peter Bühlmann,et al.  Finding predictive gene groups from microarray data , 2004 .

[9]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[10]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[11]  Gert R. G. Lanckriet,et al.  Classification of a large microarray data set: algorithm comparison and analysis of drug signatures. , 2005, Genome research.

[12]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[13]  Li Shen,et al.  Dimension reduction-based penalized logistic regression for cancer classification using microarray data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[15]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[16]  J. G. Liao,et al.  Logistic regression for disease classification using microarray data: model selection in a large p and small n case , 2007, Bioinform..

[17]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[18]  Debra L Winkeljohn Triple-negative breast cancer. , 2008, Clinical journal of oncology nursing.

[19]  Hongzhe Li,et al.  In Response to Comment on "Network-constrained regularization and variable selection for analysis of genomic data" , 2008, Bioinform..

[20]  Stephen P. Boyd,et al.  Graph Implementations for Nonsmooth Convex Programs , 2008, Recent Advances in Learning and Control.

[21]  E. Xing,et al.  Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network , 2009, PLoS genetics.

[22]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[23]  Wei Pan,et al.  Network-based support vector machine for classification of microarray samples , 2009, BMC Bioinformatics.

[24]  E. A. Nelson,et al.  Baylor College of Medicine. , 2010, Academic medicine : journal of the Association of American Medical Colleges.

[25]  Wei Pan,et al.  Predictor Network in Penalized Regression with Application to Microarray Data” , 2009 .

[26]  Paul T. Spellman,et al.  Integrating biological knowledge into variable selection: an empirical Bayes approach with an application in cancer biology , 2011, BMC Bioinformatics.

[27]  Francesco C Stingo,et al.  INCORPORATING BIOLOGICAL INFORMATION INTO LINEAR MODELS: A BAYESIAN APPROACH TO THE SELECTION OF PATHWAYS AND GENES. , 2011, The annals of applied statistics.

[28]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[29]  Robert Clarke,et al.  Identifying cancer biomarkers by network-constrained support vector machines , 2011, BMC Systems Biology.

[30]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[31]  Rui Kuang,et al.  Sparse Group Selection on Fused Lasso Components for Identifying Group-Specific DNA Copy Number Variations , 2012, 2012 IEEE 12th International Conference on Data Mining.

[32]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[33]  V. Theodorou,et al.  GATA3 acts upstream of FOXA1 in mediating ESR1 binding by shaping enhancer accessibility , 2013, Genome research.

[34]  C. Vachon,et al.  Genetic susceptibility to triple-negative breast cancer. , 2013, Cancer research.

[35]  Ata Kabán,et al.  Classification of mislabelled microarrays using robust sparse logistic regression , 2013, Bioinform..

[36]  Jian Huang,et al.  Incorporating group correlations in genome-wide association studies using smoothed group Lasso. , 2013, Biostatistics.

[37]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[38]  A. Jemal,et al.  Cancer statistics, 2013 , 2013, CA: a cancer journal for clinicians.

[39]  Jonathan E. Taylor,et al.  Interpretable whole-brain prediction analysis with GraphNet , 2013, NeuroImage.