Beta-Poisson model for single-cell RNA-seq data analyses

MOTIVATION Single-cell RNA-sequencing technology allows detection of gene expression at the single-cell level. One typical feature of the data is a bimodality in the cellular distribution even for highly expressed genes, primarily caused by a proportion of non-expressing cells. The standard and the over-dispersed gamma-Poisson models that are commonly used in bulk-cell RNA-sequencing are not able to capture this property. RESULTS We introduce a beta-Poisson mixture model that can capture the bimodality of the single-cell gene expression distribution. We further integrate the model into the generalized linear model framework in order to perform differential expression analyses. The whole analytical procedure is called BPSC. The results from several real single-cell RNA-seq datasets indicate that ∼90% of the transcripts are well characterized by the beta-Poisson model; the model-fit from BPSC is better than the fit of the standard gamma-Poisson model in > 80% of the transcripts. Moreover, in differential expression analyses of simulated and real datasets, BPSC performs well against edgeR, a conventional method widely used in bulk-cell RNA-sequencing data, and against scde and MAST, two recent methods specifically designed for single-cell RNA-seq data. AVAILABILITY AND IMPLEMENTATION An R package BPSC for model fitting and differential expression analyses of single-cell RNA-seq data is available under GPL-3 license at https://github.com/nghiavtr/BPSC CONTACT: yudi.pawitan@ki.se or mattias.rantalainen@ki.se SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  J. Muhlbauer,et al.  Peroxiredoxin proteins protect MCF-7 breast cancer cells from doxorubicin-induced toxicity. , 2014, International journal of oncology.

[2]  J. Kondev,et al.  Stochastic models of transcription: from single molecules to single cells. , 2013, Methods.

[3]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[4]  Linda R. Petzold,et al.  Inferring single-cell gene expression mechanisms using stochastic simulation , 2015, Bioinform..

[5]  Vicent Pelechano,et al.  Single-cell polyadenylation site mapping reveals 3′ isoform choice variability , 2015, Molecular systems biology.

[6]  Henning Hermjakob,et al.  The Reactome pathway Knowledgebase , 2015, Nucleic acids research.

[7]  Rona S. Gertner,et al.  Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells , 2013, Nature.

[8]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[9]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[10]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[11]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[12]  Davis J. McCarthy,et al.  Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation , 2012, Nucleic acids research.

[13]  Quin F. Wills,et al.  Single-cell gene expression analysis reveals genetic associations masked in whole-tissue experiments , 2013, Nature Biotechnology.

[14]  Masahiro Morita,et al.  Distinct perturbation of the translatome by the antidiabetic drug metformin , 2012, Proceedings of the National Academy of Sciences.

[15]  Begnaud Francis Hildebrand,et al.  Introduction to numerical analysis: 2nd edition , 1987 .

[16]  Vahid Shahrezaei,et al.  Analytical distributions for stochastic gene expression , 2008, Proceedings of the National Academy of Sciences.

[17]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[18]  P. Kharchenko,et al.  Bayesian approach to single-cell differential expression analysis , 2014, Nature Methods.

[19]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[20]  Y. Pawitan In all likelihood : statistical modelling and inference using likelihood , 2002 .

[21]  P. Linsley,et al.  MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data , 2015, Genome Biology.

[22]  Davis J. McCarthy,et al.  Count-based differential expression analysis of RNA sequencing data using R and Bioconductor , 2013, Nature Protocols.

[23]  Yudi Pawitan,et al.  False discovery rate, sensitivity and sample size for microarray studies , 2005, Bioinform..

[24]  N. Neff,et al.  Quantitative assessment of single-cell RNA-sequencing methods , 2013, Nature Methods.

[25]  F. B. Hildebrand,et al.  Introduction To Numerical Analysis , 1957 .