DeconRNASeq: a statistical framework for deconvolution of heterogeneous tissue samples based on mRNA-Seq data

SUMMARY For heterogeneous tissues, measurements of gene expression through mRNA-Seq data are confounded by relative proportions of cell types involved. In this note, we introduce an efficient pipeline: DeconRNASeq, an R package for deconvolution of heterogeneous tissues based on mRNA-Seq data. It adopts a globally optimized non-negative decomposition algorithm through quadratic programming for estimating the mixing proportions of distinctive tissue types in next-generation sequencing data. We demonstrated the feasibility and validity of DeconRNASeq across a range of mixing levels and sources using mRNA-Seq data mixed in silico at known concentrations. We validated our computational approach for various benchmark data, with high correlation between our predicted cell proportions and the real fractions of tissues. Our study provides a rigorous, quantitative and high-resolution tool as a prerequisite to use mRNA-Seq data. The modularity of package design allows an easy deployment of custom analytical pipelines for data from other high-throughput platforms. AVAILABILITY DeconRNASeq is written in R, and is freely available at http://bioconductor.org/packages. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  S. Wright,et al.  CHEMTAX - a program for estimating class abundances from chemical markers: application to HPLC measurements of phytoplankton , 1996 .

[2]  J. Szustakowski,et al.  Optimal Deconvolution of Transcriptional Profiling Data Using Quadratic Programming with Application to Complex Clinical Blood Samples , 2011, PloS one.

[3]  Gerald T. Quon,et al.  ISOLATE: a computational strategy for identifying the primary origin of cancers using high-throughput sequencing , 2009, Bioinform..

[4]  R. Faull,et al.  Population-specific expression analysis (PSEA) reveals molecular changes in diseased brain , 2011, Nature Methods.

[5]  Yingdong Zhao,et al.  Gene expression deconvolution in clinical samples , 2010, Genome Medicine.

[6]  David M. Umbach,et al.  Efficiently identifying genome-wide changes with next-generation sequencing data , 2011, Nucleic acids research.

[7]  B. Haas,et al.  Advancing RNA-Seq analysis , 2010, Nature Biotechnology.

[8]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[9]  Catalin C. Barbacioru,et al.  mRNA-Seq whole-transcriptome analysis of a single cell , 2009, Nature Methods.

[10]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[11]  Kristian Cibulskis,et al.  ContEst: estimating cross-contamination of human samples in next-generation sequencing data , 2011, Bioinform..

[12]  Rob Knight,et al.  Bayesian community-wide culture-independent microbial source tracking , 2011, Nature Methods.