Improving deconvolution methods in biology through open innovation competitions: an application to the connectivity map

Do machine learning methods improve standard deconvolution techniques for gene expression data? This paper uses a unique new dataset combined with an open innovation competition to evaluate a wide range of gene-expression deconvolution approaches developed by 294 competitors from 20 countries. The objective of the competition was to separate the expression of individual genes from composite measures of gene pairs. Outcomes were evaluated using direct measurements of single genes from the same samples. Results indicate that the winning algorithm based on random forest regression outperformed the other methods in terms of accuracy and reproducibility. More traditional gaussian-mixture methods performed well and tended to be faster. The best deep learning approach yielded outcomes slightly inferior to the above methods. We anticipate researchers in the field will find the dataset and algorithms developed in this study to be a powerful research tool for benchmarking their deconvolution methods and a useful resource for multiple applications.

[1]  Lei Xie,et al.  A Bayesian approach to accurate and robust signature detection on LINCS L1000 data , 2019, bioRxiv.

[2]  Maxim N. Artyomov,et al.  Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures , 2019, Nature Communications.

[3]  Qionghai Dai,et al.  Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning , 2019, Nature Methods.

[4]  Karim R. Lakhani,et al.  Advancing computational biology and bioinformatics research through open innovation competitions , 2019, bioRxiv.

[5]  David L. Smith,et al.  Biased efficacy estimates in phase-III dengue vaccine trials due to heterogeneous exposure and differential detectability of primary infections across trial arms , 2019, PloS one.

[6]  A. Regev,et al.  Efficient Generation of Transcriptomic Profiles by Random Composite Measurements , 2017, Cell.

[7]  Angela N. Brooks,et al.  A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles , 2017, Cell.

[8]  Dongmei Li,et al.  Bon-EV: an improved multiple testing procedure for controlling false discovery rates , 2017, BMC Bioinformatics.

[9]  Ka Yee Yeung,et al.  Model-Based Clustering With Data Correction For Removing Artifacts In Gene Expression Data. , 2016, The annals of applied statistics.

[10]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[11]  Ash A. Alizadeh,et al.  Robust enumeration of cell subsets from tissue expression profiles , 2015, Nature Methods.

[12]  Stephan Preibisch,et al.  Efficient Bayesian-based multiview deconvolution , 2013, Nature Methods.

[13]  S. Shen-Orr,et al.  Computational deconvolution: extracting cell type-specific information from heterogeneous samples. , 2013, Current opinion in immunology.

[14]  Benjamin M. Good,et al.  Crowdsourcing for bioinformatics , 2013, Bioinform..

[15]  Eric Lonstein,et al.  Prize-based contests can provide solutions to computational biology problems , 2013, Nature Biotechnology.

[16]  Zhandong Liu,et al.  Gene expression deconvolution in linear space , 2011, Nature Methods.

[17]  Mark M. Davis,et al.  Cell type–specific gene expression differences in complex tissues , 2010, Nature Methods.

[18]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[19]  Aleksey A. Nakorchevskiy,et al.  Expression deconvolution: A reinterpretation of DNA microarray data reveals dynamic changes in cell populations , 2003, Proceedings of the National Academy of Sciences of the United States of America.