RNA-Seq Count Data Modelling by Grey Relational Analysis and Nonparametric Gaussian Process

This paper introduces an approach to classification of RNA-seq read counts using grey relational analysis (GRA) and Bayesian Gaussian process (GP) models. Read counts are transformed to microarray-like data to facilitate normal-based statistical methods. GRA is designed to select differentially expressed genes by integrating outcomes of five individual feature selection methods including two-sample t-test, entropy test, Bhattacharyya distance, Wilcoxon test and receiver operating characteristic curve. GRA performs as an aggregate filter method through combining advantages of the individual methods to produce significant feature subsets that are then fed into a nonparametric GP model for classification. The proposed approach is verified by using two benchmark real datasets and the five-fold cross-validation method. Experimental results show the performance dominance of the GRA-based feature selection method as well as GP classifier against their competing methods. Moreover, the results demonstrate that GRA-GP considerably dominates the sparse Poisson linear discriminant analysis classifiers, which were introduced specifically for read counts, on different number of features. The proposed approach therefore can be implemented effectively in real practice for read count data analysis, which is useful in many applications including understanding disease pathogenesis, diagnosis and treatment monitoring at the molecular level.

[1]  Harald Binder,et al.  Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures , 2014, PloS one.

[2]  Edward R. Dougherty,et al.  Modeling the next generation sequencing sample processing pipeline for the purposes of classification , 2013, BMC Bioinformatics.

[3]  Joseph K. Pickrell,et al.  Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.

[4]  Baolin Wu,et al.  Network-Based Isoform Quantification with RNA-Seq Data for Cancer Transcriptome Analysis , 2014, PLoS Comput. Biol..

[5]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[6]  R. Tibshirani,et al.  Normalization, testing, and false discovery rate estimation for RNA-sequencing data. , 2012, Biostatistics.

[7]  Laura L. Elo,et al.  Comparison of software packages for detecting differential expression in RNA-seq studies , 2013, Briefings Bioinform..

[8]  NahavandiSaeid,et al.  EEG signal classification for BCI applications by wavelets and interval type-2 fuzzy logic systems , 2015 .

[9]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[10]  Naftali Tishby,et al.  Margin based feature selection - theory and algorithms , 2004, ICML.

[11]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[12]  Daniela M. Witten,et al.  Classification and clustering of sequencing data using a poisson model , 2011, 1202.6201.

[13]  Ming-Feng Yeh,et al.  ROBOT PATH PLANNING BASED ON MODIFIED GREY RELATIONAL ANALYSIS , 2002, Cybern. Syst..

[14]  Jie Zhou,et al.  RNA-seq differential expression studies: more sequence or more replication? , 2014, Bioinform..

[15]  N. Navin,et al.  Clonal Evolution in Breast Cancer Revealed by Single Nucleus Genome Sequencing , 2014, Nature.

[16]  Carl E. Rasmussen,et al.  Gaussian Processes for Machine Learning (GPML) Toolbox , 2010, J. Mach. Learn. Res..

[17]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[18]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[19]  Chulhee Lee,et al.  Feature extraction based on the Bhattacharyya distance , 2003, Pattern Recognit..

[20]  Mark D. Robinson,et al.  Robustly detecting differential expression in RNA sequencing data using observation weights , 2013, Nucleic acids research.

[21]  Saeid Nahavandi,et al.  Modified AHP for Gene Selection and Cancer Classification Using Type-2 Fuzzy Logic , 2016, IEEE Transactions on Fuzzy Systems.

[22]  Carl E. Rasmussen,et al.  Assessing Approximate Inference for Binary Gaussian Process Classification , 2005, J. Mach. Learn. Res..

[23]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[24]  Taho Yang,et al.  The use of grey relational analysis in solving multiple attribute decision-making problems , 2008, Comput. Ind. Eng..

[25]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[26]  Hao Wu,et al.  A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data , 2012, Biostatistics.

[27]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[28]  Charlotte Soneson,et al.  A comparison of methods for differential expression analysis of RNA-seq data , 2013, BMC Bioinformatics.

[29]  Laura L. Elo,et al.  A Note on an Exon-Based Strategy to Identify Differentially Expressed Genes in RNA-Seq Experiments , 2014, PloS one.

[30]  R. Tibshirani,et al.  Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls , 2010, BMC Biology.

[31]  M. Robinson,et al.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[32]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[33]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[34]  Jian Pei,et al.  A rank sum test method for informative gene discovery , 2004, KDD.

[35]  Peng Liu,et al.  An Optimal Test with Maximum Average Power While Controlling FDR with Application to RNA‐Seq Data , 2013, Biometrics.

[36]  S. Srivastava,et al.  A two-parameter generalized Poisson model to improve the analysis of RNA-seq data , 2010, Nucleic acids research.

[37]  L. AuerPaul,et al.  A Two-Stage Poisson Model for Testing RNA-Seq Data , 2011 .

[38]  R. Guigó,et al.  Transcriptome genetics using second generation sequencing in a Caucasian population , 2010, Nature.

[39]  Deng Ju-Long,et al.  Control problems of grey systems , 1982 .

[40]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[41]  Anthony D Whetton,et al.  THOC5/FMIP, an mRNA export TREX complex protein, is essential for hematopoietic primitive cell survival in vivo , 2010, BMC Biology.

[42]  Hsin-Hung Wu,et al.  A Comparative Study of Using Grey Relational Analysis in Multiple Attribute Decision Making Problems , 2002 .

[43]  Dennis B. Troup,et al.  NCBI GEO: mining millions of expression profiles—database and tools , 2004, Nucleic Acids Res..

[44]  Subhabrata Chakraborti,et al.  Nonparametric Statistical Inference , 2011, International Encyclopedia of Statistical Science.

[45]  Saeid Nahavandi,et al.  Mass spectrometry cancer data classification using wavelets and genetic algorithm , 2015, FEBS letters.

[46]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[47]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[48]  Saeid Nahavandi,et al.  EEG signal classification for BCI applications by wavelets and interval type-2 fuzzy logic systems , 2015, Expert Syst. Appl..

[49]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[50]  Alyssa C. Frazee,et al.  ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.

[51]  R. Tibshirani,et al.  Penalized classification using Fisher's linear discriminant , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[52]  Igor V. Tetko,et al.  Gene selection from microarray data for cancer classification - a machine learning approach , 2005, Comput. Biol. Chem..

[53]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[54]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[55]  Sheng Li,et al.  An optimized algorithm for detecting and annotating regional differential methylation , 2013, BMC Bioinformatics.

[56]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[57]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.