Leveraging heterogeneity across multiple datasets increases cell-mixture deconvolution accuracy and reduces biological and technical biases

In silico quantification of cell proportions from mixed-cell transcriptomics data (deconvolution) requires a reference expression matrix, called basis matrix. We hypothesize that matrices created using only healthy samples from a single microarray platform would introduce biological and technical biases in deconvolution. We show presence of such biases in two existing matrices, IRIS and LM22, irrespective of deconvolution method. Here, we present immunoStates, a basis matrix built using 6160 samples with different disease states across 42 microarray platforms. We find that immunoStates significantly reduces biological and technical biases. Importantly, we find that different methods have virtually no or minimal effect once the basis matrix is chosen. We further show that cellular proportion estimates using immunoStates are consistently more correlated with measured proportions than IRIS and LM22, across all methods. Our results demonstrate the need and importance of incorporating biological and technical heterogeneity in a basis matrix for achieving consistently high accuracy.Cell type deconvolution from bulk expression data rely on a reference expression matrix. Here, the authors introduce a basis matrix built using data from both healthy and diseased samples profiled on 42 platforms, reducing biases introduced by single-platform matrices built using healthy samples.

[1]  Francesco Vallania,et al.  Methods to increase reproducibility in differential gene expression via meta-analysis , 2016, Nucleic acids research.

[2]  Z. Modrušan,et al.  Deconvolution of Blood Microarray Data Identifies Cellular Activation Patterns in Systemic Lupus Erythematosus , 2009, PloS one.

[3]  Alexander A. Morgan,et al.  A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics for organ transplantation , 2013, The Journal of experimental medicine.

[4]  A. Butte,et al.  SMYD3 links lysine methylation of MAP3K2 to Ras-driven cancer , 2014, Nature.

[5]  Mei Yu,et al.  PERT: A Method for Expression Deconvolution of Human Blood Samples from Varied Microenvironmental and Developmental Conditions , 2012, PLoS Comput. Biol..

[6]  Mark M. Davis,et al.  Cell type–specific gene expression differences in complex tissues , 2010, Nature Methods.

[7]  Ash A. Alizadeh,et al.  Abstract PR09: The prognostic landscape of genes and infiltrating immune cells across human cancers , 2015 .

[8]  Purvesh Khatri,et al.  A comprehensive time-course–based multicohort analysis of sepsis and sterile inflammation reveals a robust diagnostic gene set , 2015, Science Translational Medicine.

[9]  Ash A. Alizadeh,et al.  Robust enumeration of cell subsets from tissue expression profiles , 2015, Nature Methods.

[10]  Renaud Gaujoux,et al.  CellMix: a comprehensive toolbox for gene expression deconvolution , 2013, Bioinform..

[11]  J. Szustakowski,et al.  Optimal Deconvolution of Transcriptional Profiling Data Using Quadratic Programming with Application to Complex Clinical Blood Samples , 2011, PloS one.

[12]  Steven H. Kleinstein,et al.  Aging-dependent alterations in gene expression and a mitochondrial signature of responsiveness to human influenza vaccination , 2015, Aging.

[13]  S. Shen-Orr,et al.  Computational deconvolution: extracting cell type-specific information from heterogeneous samples. , 2013, Current opinion in immunology.

[14]  Purvesh Khatri,et al.  Integrated, Multi-cohort Analysis Identifies Conserved Transcriptional Signatures across Multiple Respiratory Viruses , 2015, Immunity.

[15]  R. Faull,et al.  Population-specific expression analysis (PSEA) reveals molecular changes in diseased brain , 2011, Nature Methods.

[16]  Purvesh Khatri,et al.  Genome-wide expression for diagnosis of pulmonary tuberculosis: a multicohort analysis. , 2016, The Lancet. Respiratory medicine.

[17]  Hugues Bersini,et al.  Separation of samples into their constituents using gene expression data , 2001, ISMB.

[18]  P. Khatri,et al.  Robust classification of bacterial and viral infections via integrated host gene expression diagnostics , 2016, Science Translational Medicine.

[19]  Junlei Chang,et al.  Expression of specific inflammasome gene modules stratifies older individuals into two extreme clinical and immunological states , 2017, Nature Medicine.

[20]  H. Parkinson,et al.  A global map of human gene expression , 2010, Nature Biotechnology.

[21]  Christophe Ley,et al.  Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median , 2013 .

[22]  Purvesh Khatri,et al.  A meta-analysis of lung cancer gene expression identifies PTK7 as a survival gene in lung adenocarcinoma. , 2014, Cancer research.

[23]  Alex Kuo,et al.  Integrated, multicohort analysis of systemic sclerosis identifies robust transcriptional signature of disease severity. , 2016, JCI insight.

[24]  Alexander R. Abbas,et al.  Immune response in silico (IRIS): immune-specific genes identified from a compendium of microarray expression data , 2005, Genes and Immunity.

[25]  Winston Haynes,et al.  Empowering Multi-Cohort Gene Expression Analysis to Increase Reproducibility , 2016 .

[26]  Purvesh Khatri,et al.  Integrated multi-cohort transcriptional meta-analysis of neurodegenerative diseases , 2014, Acta neuropathologica communications.