Analysis of cancer gene expression data with an assisted robust marker identification approach

Gene expression (GE) studies have been playing a critical role in cancer research. Despite tremendous effort, the analysis results are still often unsatisfactory, because of the weak signals and high data dimensionality. Analysis is often further challenged by the long‐tailed distributions of the outcome variables. In recent multidimensional studies, data have been collected on GEs as well as their regulators (e.g., copy number alterations (CNAs), methylation, and microRNAs), which can provide additional information on the associations between GEs and cancer outcomes. In this study, we develop an ARMI (assisted robust marker identification) approach for analyzing cancer studies with measurements on GEs as well as regulators. The proposed approach borrows information from regulators and can be more effective than analyzing GE data alone. A robust objective function is adopted to accommodate long‐tailed distributions. Marker identification is effectively realized using penalization. The proposed approach has an intuitive formulation and is computationally much affordable. Simulation shows its satisfactory performance under a variety of settings. TCGA (The Cancer Genome Atlas) data on melanoma and lung cancer are analyzed, which leads to biologically plausible marker identification and superior prediction.

[1]  Age K. Smilde,et al.  Real-life metabolomics data analysis : how to deal with complex data ? , 2010 .

[2]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[3]  Shuangge Ma,et al.  A selective review of robust variable selection with applications in bioinformatics , 2015, Briefings Bioinform..

[4]  Michael Krauthammer,et al.  Integrated analysis of multidimensional omics data on cutaneous melanoma prognosis. , 2016, Genomics.

[5]  Eric P. Xing,et al.  A multivariate regression approach to association analysis of a quantitative trait network , 2008, Bioinform..

[6]  Jian Huang,et al.  Penalized feature selection and classification in bioinformatics , 2008, Briefings Bioinform..

[7]  Shuangge Ma,et al.  Integrating multidimensional omics data for cancer outcome. , 2016, Biostatistics.

[8]  Jianqing Fan,et al.  ADAPTIVE ROBUST VARIABLE SELECTION. , 2012, Annals of statistics.

[9]  Jeffrey S. Morris,et al.  iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data , 2012, Bioinform..

[10]  Katja Fall,et al.  Reliability of death certificates in prostate cancer patients , 2008, Scandinavian journal of urology and nephrology.

[11]  Jian Huang,et al.  Integrative Analysis of High‐throughput Cancer Studies With Contrasted Penalization , 2014, Genetic epidemiology.

[12]  Qing Zhao,et al.  Deciphering the associations between gene expression and copy number alteration using a sparse double Laplacian shrinkage approach , 2015, Bioinform..

[13]  Trevor Hastie,et al.  Averaged gene expressions for regression. , 2007, Biostatistics.

[14]  Runze Li,et al.  Quantile Regression for Analyzing Heterogeneity in Ultra-High Dimension , 2012, Journal of the American Statistical Association.

[15]  Qing Zhao,et al.  Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA , 2015, Briefings Bioinform..

[16]  Robert Tibshirani,et al.  Collaborative regression. , 2014, Biostatistics.

[17]  Ji Zhu,et al.  Regularized Multivariate Regression for Identifying Master Predictors with Application to Integrative Genomics Study of Breast Cancer. , 2008, The annals of applied statistics.

[18]  Zi Wang,et al.  Network-guided regression for detecting associations between DNA methylation and gene expression , 2014, Bioinform..