Leveraging heterogeneity across multiple data sets increases accuracy of cell-mixture deconvolution and reduces biological and technical biases

In silico quantification of cell proportions from mixed-cell transcriptomics data (deconvolution) requires a reference expression matrix, called basis matrix. We hypothesized that matrices created using only healthy samples from a single microarray platform would introduce biological and technical biases in deconvolution. We show presence of such biases in two existing matrices, IRIS and LM22, irrespective of the deconvolution method used. Here, we present immunoStates, a basis matrix built using 6160 samples with different disease states across 42 microarray platforms. We found that immunoStates significantly reduced biological and technical biases. We further show that cellular proportion estimates using immunoStates are consistently more correlated with measured proportions than IRIS and LM22, across all methods. Importantly, we found that different methods have virtually no effect once the basis matrix is chosen. Our results demonstrate the need and importance of incorporating biological and technical heterogeneity in a basis matrix for achieving consistently high accuracy.

[1]  Ash A. Alizadeh,et al.  Abstract PR09: The prognostic landscape of genes and infiltrating immune cells across human cancers , 2015 .

[2]  Renaud Gaujoux,et al.  CellMix: a comprehensive toolbox for gene expression deconvolution , 2013, Bioinform..

[3]  P. Khatri,et al.  Robust classification of bacterial and viral infections via integrated host gene expression diagnostics , 2016, Science Translational Medicine.

[4]  S. Shen-Orr,et al.  Computational deconvolution: extracting cell type-specific information from heterogeneous samples. , 2013, Current opinion in immunology.

[5]  A. Butte,et al.  SMYD3 links lysine methylation of MAP3K2 to Ras-driven cancer , 2014, Nature.

[6]  Mark M. Davis,et al.  Cell type–specific gene expression differences in complex tissues , 2010, Nature Methods.

[7]  R. Faull,et al.  Population-specific expression analysis (PSEA) reveals molecular changes in diseased brain , 2011, Nature Methods.

[8]  Z. Modrušan,et al.  Deconvolution of Blood Microarray Data Identifies Cellular Activation Patterns in Systemic Lupus Erythematosus , 2009, PloS one.

[9]  Hugues Bersini,et al.  Separation of samples into their constituents using gene expression data , 2001, ISMB.

[10]  Alexander R. Abbas,et al.  Immune response in silico (IRIS): immune-specific genes identified from a compendium of microarray expression data , 2005, Genes and Immunity.

[11]  Alexander A. Morgan,et al.  A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics for organ transplantation , 2013, The Journal of experimental medicine.

[12]  Christophe Ley,et al.  Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median , 2013 .

[13]  Purvesh Khatri,et al.  A meta-analysis of lung cancer gene expression identifies PTK7 as a survival gene in lung adenocarcinoma. , 2014, Cancer research.

[14]  Alex Kuo,et al.  Integrated, multicohort analysis of systemic sclerosis identifies robust transcriptional signature of disease severity. , 2016, JCI insight.

[15]  Purvesh Khatri,et al.  Integrated multi-cohort transcriptional meta-analysis of neurodegenerative diseases , 2014, Acta neuropathologica communications.

[16]  Purvesh Khatri,et al.  A comprehensive time-course–based multicohort analysis of sepsis and sterile inflammation reveals a robust diagnostic gene set , 2015, Science Translational Medicine.

[17]  Junlei Chang,et al.  Expression of specific inflammasome gene modules stratifies older individuals into two extreme clinical and immunological states , 2017, Nature Medicine.

[18]  H. Parkinson,et al.  A global map of human gene expression , 2010, Nature Biotechnology.

[19]  Ash A. Alizadeh,et al.  Robust enumeration of cell subsets from tissue expression profiles , 2015, Nature Methods.

[20]  Francesco Vallania,et al.  Methods to increase reproducibility in differential gene expression via meta-analysis , 2016, Nucleic acids research.

[21]  Winston Haynes,et al.  Empowering Multi-Cohort Gene Expression Analysis to Increase Reproducibility , 2016, bioRxiv.

[22]  J. Szustakowski,et al.  Optimal Deconvolution of Transcriptional Profiling Data Using Quadratic Programming with Application to Complex Clinical Blood Samples , 2011, PloS one.

[23]  Purvesh Khatri,et al.  Integrated, Multi-cohort Analysis Identifies Conserved Transcriptional Signatures across Multiple Respiratory Viruses , 2015, Immunity.

[24]  Steven H. Kleinstein,et al.  Aging-dependent alterations in gene expression and a mitochondrial signature of responsiveness to human influenza vaccination , 2015, Aging.

[25]  Purvesh Khatri,et al.  Genome-wide expression for diagnosis of pulmonary tuberculosis: a multicohort analysis. , 2016, The Lancet. Respiratory medicine.