论文信息 - RNA-seq data analysis using nonparametric Gaussian process models - 字舞流文

RNA-seq data analysis using nonparametric Gaussian process models

This paper introduces an approach to classification of RNA-seq read count data using Gaussian process (GP) models. RNA-seq data are transformed into microarray-like data before applying the statistical two-sample t-test for gene selection. GP is designed as a classifier that takes discriminant genes selected by the t-test method as inputs. The proposed approach is verified by using two benchmark real datasets and the five-fold cross-validation strategy. Various performance metrics that include accuracy rate, F-measure, area under the ROC curve and mutual information are used to evaluate the classifiers. Experimental results show the significant dominance of the GP classifier against its competing methods including k-nearest neighbors, multilayer perceptron, support vector machine and ensemble learning AdaBoost. The proposed approach therefore can be implemented effectively in real practice for RNA-seq data analysis, which is useful in many applications related to disease diagnosis and monitoring at the molecular level.

Saeid Nahavandi | Abbas Khosravi | Douglas C. Creighton | Thanh Nguyen | S. Nahavandi | A. Khosravi | D. Creighton | T. Nguyen

[1] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[2] Emanuele Della Valle,et al. Classification and Clustering , 2021, Foundations of Statistics for Data Scientists.

[3] R. Tibshirani,et al. Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls , 2010, BMC Biology.

[4] Matthias W. Seeger,et al. PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[5] W. Huber,et al. which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[6] 2016 International Joint Conference on Neural Networks, IJCNN 2016, Vancouver, BC, Canada, July 24-29, 2016 , 2016, IJCNN.

[7] W. Cleveland. Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[8] Harald Binder,et al. Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures , 2014, PloS one.

[9] Edward R. Dougherty,et al. Modeling the next generation sequencing sample processing pipeline for the purposes of classification , 2013, BMC Bioinformatics.

[10] Joseph K. Pickrell,et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.

[11] Alan F. Murray,et al. International Joint Conference on Neural Networks , 1993 .

[12] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[13] Carl E. Rasmussen,et al. Assessing Approximate Inference for Binary Gaussian Process Classification , 2005, J. Mach. Learn. Res..

[14] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[15] Alyssa C. Frazee,et al. ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.

[16] M. Gerstein,et al. RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[17] R. Doerge,et al. Statistical Applications in Genetics and Molecular Biology A Two-Stage Poisson Model for Testing RNA-Seq Data , 2011 .

[18] R. Tibshirani,et al. Normalization, testing, and false discovery rate estimation for RNA-sequencing data. , 2012, Biostatistics.

[19] W. Kruskal,et al. Use of Ranks in One-Criterion Variance Analysis , 1952 .

[20] Hao Wu,et al. A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data , 2012, Biostatistics.

[21] Heekuck Oh,et al. Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[22] M. Robinson,et al. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[23] M. Robinson,et al. A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[24] W. Huber,et al. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[25] R. Guigó,et al. Transcriptome genetics using second generation sequencing in a Caucasian population , 2010, Nature.

[26] Saeid Nahavandi,et al. Modified AHP for Gene Selection and Cancer Classification Using Type-2 Fuzzy Logic , 2016, IEEE Transactions on Fuzzy Systems.

[27] Thomas J. Hardcastle,et al. baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[28] Daniela M. Witten,et al. Classification and clustering of sequencing data using a poisson model , 2011, 1202.6201.

[29] Carl E. Rasmussen,et al. Gaussian Processes for Machine Learning (GPML) Toolbox , 2010, J. Mach. Learn. Res..

[30] W. Huber,et al. Differential expression analysis for sequence count data , 2010 .

[31] Saeid Nahavandi,et al. Mass spectrometry cancer data classification using wavelets and genetic algorithm , 2015, FEBS letters.

[32] Saeid Nahavandi,et al. Hierarchical Gene Selection and Genetic Fuzzy System for Cancer Microarray Data Classification , 2015, PloS one.

[33] Charity W. Law,et al. voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[34] Peng Liu,et al. An Optimal Test with Maximum Average Power While Controlling FDR with Application to RNA‐Seq Data , 2013, Biometrics.

[35] S. Srivastava,et al. A two-parameter generalized Poisson model to improve the analysis of RNA-seq data , 2010, Nucleic acids research.

[36] Dennis B. Troup,et al. NCBI GEO: mining millions of expression profiles—database and tools , 2004, Nucleic Acids Res..