Missing Value Imputation for RNA-Sequencing Data Using Statistical Models: A Comparative Study

RNA-seq technology has been widely used as an alternative approach to traditional microarrays in transcript analysis. Sometimes gene expression by sequencing, which generates RNA-seq data set, may have missing read counts. These missing values can adversely affect downstream analyses. Most of the methods for analysing the RNA-seq data sets require a complete matrix of RNA-seq data. In the past few years, researchers have been putting a great deal of effort into presenting evaluations of the different imputation algorithms in microarray gene expression data sets, However, these are limited works for RNA-seq data sets and a comparative study for investigating the performance of the missing value imputation for RNA-seq data is essential. In this paper, we propose the use of some parametric models such as Regression imputation, Bayesian generalized linear model, Poisson mixture model, EM approach , Bayesian Poisson regression, Bayesian quasi-Poisson regression and the Bootstrap version of two latter for single imputation of missing values in RNA-seq count data sets. The approaches are also applied for identifying differentially expressed genes in the presence of missing values. Multiple imputation, proposed by Rubin (1978), is also used for multiple imputation of missing RNA-seq counts. This approach allows appropriate assessment of imputation uncertainty for missing values. The performance of the single and multiple imputations are investigated using some simulation studies. Also, some real data sets are analyzed using the proposed approaches.

[1]  Ming Ouyang,et al.  DNA microarray data imputation and significance analysis of differential expression , 2005, Bioinform..

[2]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[3]  Gilles Celeux,et al.  Clustering high-throughput sequencing data with Poisson mixture models , 2011 .

[4]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[5]  Kristian Kleinke,et al.  countimp 1.0 - A multiple imputation package for incomplete count data (technical report) , 2011 .

[6]  J. Hilbe Negative Binomial Regression: Preface , 2007 .

[7]  Iqbal Gondal,et al.  Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data , 2005, Bioinform..

[8]  Steven P Lund,et al.  Statistical Applications in Genetics and Molecular Biology Detecting Differential Expression in RNA-sequence Data Using Quasi-likelihood with Shrunken Dispersion Estimates , 2012 .

[9]  Russell V. Lenth,et al.  Statistical Analysis With Missing Data (2nd ed.) (Book) , 2004 .

[10]  Hong Yan,et al.  Missing value imputation for gene expression data: computational techniques to recover missing data from available information , 2011, Briefings Bioinform..

[11]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[12]  Terence P. Speed,et al.  Comparison of Methods for Image Analysis on cDNA Microarray Data , 2002 .

[13]  M. Emre Celebi,et al.  Partitional Clustering Algorithms , 2014 .

[14]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[15]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[16]  Xiaofeng Song,et al.  Sequential local least squares imputation estimating missing value of microarray data , 2008, Comput. Biol. Medicine.

[17]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[18]  Joseph K. Pickrell,et al.  Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.

[19]  Xuegong Zhang,et al.  DEGseq: an R package for identifying differentially expressed genes from RNA-seq data , 2010, Bioinform..

[20]  Mou'ath Hourani,et al.  Microarray missing values imputation methods: Critical analysis review , 2009, Comput. Sci. Inf. Syst..

[21]  Fritz Scheuren,et al.  Hot Deck Imputation Procedure Applied to Double Sampling Design , 1986 .

[22]  Nicolas Delhomme,et al.  easyRNASeq: a bioconductor package for processing RNA-Seq data , 2012, Bioinform..

[23]  Mohd Saberi Mohamad,et al.  A Review on Missing Value Imputation Algorithms for Microarray Gene Expression Data , 2014 .

[24]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[25]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[26]  D. Rubin,et al.  MULTIPLE IMPUTATIONS IN SAMPLE SURVEYS-A PHENOMENOLOGICAL BAYESIAN APPROACH TO NONRESPONSE , 2002 .

[27]  R. Guigó,et al.  Transcriptome genetics using second generation sequencing in a Caucasian population , 2010, Nature.

[28]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Tero Aittokallio,et al.  Dealing with missing values in large-scale studies: microarray data imputation and beyond , 2010, Briefings Bioinform..

[30]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[31]  Peng Liu,et al.  Model-based clustering for RNA-seq data , 2014, Bioinform..

[32]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[33]  Lígia P. Brás,et al.  Improving cluster-based missing value estimation of DNA microarray data. , 2007, Biomolecular engineering.

[34]  Allan R. Wilks,et al.  The new S language: a programming environment for data analysis and graphics , 1988 .

[35]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[36]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[37]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.