RNA-seq data analysis using nonparametric Gaussian process models

This paper introduces an approach to classification of RNA-seq read count data using Gaussian process (GP) models. RNA-seq data are transformed into microarray-like data before applying the statistical two-sample t-test for gene selection. GP is designed as a classifier that takes discriminant genes selected by the t-test method as inputs. The proposed approach is verified by using two benchmark real datasets and the five-fold cross-validation strategy. Various performance metrics that include accuracy rate, F-measure, area under the ROC curve and mutual information are used to evaluate the classifiers. Experimental results show the significant dominance of the GP classifier against its competing methods including k-nearest neighbors, multilayer perceptron, support vector machine and ensemble learning AdaBoost. The proposed approach therefore can be implemented effectively in real practice for RNA-seq data analysis, which is useful in many applications related to disease diagnosis and monitoring at the molecular level.

[1]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[2]  Emanuele Della Valle,et al.  Classification and Clustering , 2021, Foundations of Statistics for Data Scientists.

[3]  R. Tibshirani,et al.  Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls , 2010, BMC Biology.

[4]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[5]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[6]  2016 International Joint Conference on Neural Networks, IJCNN 2016, Vancouver, BC, Canada, July 24-29, 2016 , 2016, IJCNN.

[7]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[8]  Harald Binder,et al.  Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures , 2014, PloS one.

[9]  Edward R. Dougherty,et al.  Modeling the next generation sequencing sample processing pipeline for the purposes of classification , 2013, BMC Bioinformatics.

[10]  Joseph K. Pickrell,et al.  Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.

[11]  Alan F. Murray,et al.  International Joint Conference on Neural Networks , 1993 .

[12]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[13]  Carl E. Rasmussen,et al.  Assessing Approximate Inference for Binary Gaussian Process Classification , 2005, J. Mach. Learn. Res..

[14]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[15]  Alyssa C. Frazee,et al.  ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.

[16]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[17]  R. Doerge,et al.  Statistical Applications in Genetics and Molecular Biology A Two-Stage Poisson Model for Testing RNA-Seq Data , 2011 .

[18]  R. Tibshirani,et al.  Normalization, testing, and false discovery rate estimation for RNA-sequencing data. , 2012, Biostatistics.

[19]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[20]  Hao Wu,et al.  A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data , 2012, Biostatistics.

[21]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[22]  M. Robinson,et al.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[23]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[24]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[25]  R. Guigó,et al.  Transcriptome genetics using second generation sequencing in a Caucasian population , 2010, Nature.

[26]  Saeid Nahavandi,et al.  Modified AHP for Gene Selection and Cancer Classification Using Type-2 Fuzzy Logic , 2016, IEEE Transactions on Fuzzy Systems.

[27]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[28]  Daniela M. Witten,et al.  Classification and clustering of sequencing data using a poisson model , 2011, 1202.6201.

[29]  Carl E. Rasmussen,et al.  Gaussian Processes for Machine Learning (GPML) Toolbox , 2010, J. Mach. Learn. Res..

[30]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[31]  Saeid Nahavandi,et al.  Mass spectrometry cancer data classification using wavelets and genetic algorithm , 2015, FEBS letters.

[32]  Saeid Nahavandi,et al.  Hierarchical Gene Selection and Genetic Fuzzy System for Cancer Microarray Data Classification , 2015, PloS one.

[33]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[34]  Peng Liu,et al.  An Optimal Test with Maximum Average Power While Controlling FDR with Application to RNA‐Seq Data , 2013, Biometrics.

[35]  S. Srivastava,et al.  A two-parameter generalized Poisson model to improve the analysis of RNA-seq data , 2010, Nucleic acids research.

[36]  Dennis B. Troup,et al.  NCBI GEO: mining millions of expression profiles—database and tools , 2004, Nucleic Acids Res..