ReliefSeq: A Gene-Wise Adaptive-K Nearest-Neighbor Feature Selection Tool for Finding Gene-Gene Interactions and Main Effects in mRNA-Seq Gene Expression Data

Relief-F is a nonparametric, nearest-neighbor machine learning method that has been successfully used to identify relevant variables that may interact in complex multivariate models to explain phenotypic variation. While several tools have been developed for assessing differential expression in sequence-based transcriptomics, the detection of statistical interactions between transcripts has received less attention in the area of RNA-seq analysis. We describe a new extension and assessment of Relief-F for feature selection in RNA-seq data. The ReliefSeq implementation adapts the number of nearest neighbors (k) for each gene to optimize the Relief-F test statistics (importance scores) for finding both main effects and interactions. We compare this gene-wise adaptive-k (gwak) Relief-F method with standard RNA-seq feature selection tools, such as DESeq and edgeR, and with the popular machine learning method Random Forests. We demonstrate performance on a panel of simulated data that have a range of distributional properties reflected in real mRNA-seq data including multiple transcripts with varying sizes of main effects and interaction effects. For simulated main effects, gwak-Relief-F feature selection performs comparably to standard tools DESeq and edgeR for ranking relevant transcripts. For gene-gene interactions, gwak-Relief-F outperforms all comparison methods at ranking relevant genes in all but the highest fold change/highest signal situations where it performs similarly. The gwak-Relief-F algorithm outperforms Random Forests for detecting relevant genes in all simulation experiments. In addition, Relief-F is comparable to the other methods based on computational time. We also apply ReliefSeq to an RNA-Seq study of smallpox vaccine to identify gene expression changes between vaccinia virus-stimulated and unstimulated samples. ReliefSeq is an attractive tool for inclusion in the suite of tools used for analysis of mRNA-Seq data; it has power to detect both main effects and interaction effects. Software Availability: http://insilico.utulsa.edu/ReliefSeq.php.

[1]  Xin Wang,et al.  SNP interaction detection with Random Forests in high-dimensional genetic data , 2012, BMC Bioinformatics.

[2]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[3]  Jason H. Moore,et al.  Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions , 2009, BioData Mining.

[4]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[5]  Heping Zhang,et al.  A forest-based approach to identifying gene and gene–gene interactions , 2007, Proceedings of the National Academy of Sciences.

[6]  A. Fraser,et al.  Systematic mapping of genetic interactions in Caenorhabditis elegans identifies common modifiers of diverse signaling pathways , 2006, Nature Genetics.

[7]  A. G. de la Fuente From 'differential expression' to 'differential networking' - identification of dysfunctional regulatory networks in diseases. , 2010, Trends in genetics : TIG.

[8]  Jason H. Moore,et al.  Tuning ReliefF for Genome-Wide Genetic Analysis , 2007, EvoBIO.

[9]  Eva K. Lee,et al.  Systems biology approach predicts immunogenicity of the yellow fever vaccine in humans , 2009, Nature Immunology.

[10]  Davis J. McCarthy,et al.  Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation , 2012, Nucleic acids research.

[11]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[12]  Bill C White,et al.  Encore: Genetic Association Interaction Network Centrality Pipeline and Application to SLE Exome Data , 2013, Genetic epidemiology.

[13]  T. Therneau,et al.  Technical and biological variance structure in mRNA-Seq data: life in the real world , 2012, BMC Genomics.

[14]  T. Flatt The Evolutionary Genetics of Canalization , 2005, The Quarterly Review of Biology.

[15]  S. Rutherford,et al.  Control of Canalization and Evolvability by Hsp90 , 2006, PloS one.

[16]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[17]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[18]  Marko Robnik-Sikonja,et al.  Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF , 2004, Applied Intelligence.

[19]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[21]  K. Hansen,et al.  Removing technical variability in RNA-seq data using conditional quantile normalization , 2012, Biostatistics.

[22]  Jason H. Moore,et al.  Evaporative cooling feature selection for genotypic data involving interactions , 2007, Bioinform..

[23]  David I. Smith,et al.  Genome-Wide Characterization of Transcriptional Patterns in High and Low Antibody Responders to Rubella Vaccination , 2013, PloS one.

[24]  A. Fuente,et al.  From ‘differential expression’ to ‘differential networking’ – identification of dysfunctional regulatory networks in diseases , 2010 .

[25]  A. Bergman,et al.  Waddington's canalization revisited: Developmental stability and evolution , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[26]  R. Gentleman,et al.  Independent filtering increases detection power for high-throughput experiments , 2010, Proceedings of the National Academy of Sciences.

[27]  B A McKinney,et al.  Epistasis network centrality analysis yields pathway replication across two GWAS cohorts for bipolar disorder , 2012, Translational Psychiatry.

[28]  A. Oberg,et al.  Transcriptomic Profiles of High and Low Antibody Responders to Smallpox Vaccine , 2013, Genes and Immunity.

[29]  R. Tibshirani,et al.  Normalization, testing, and false discovery rate estimation for RNA-sequencing data. , 2012, Biostatistics.

[30]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[31]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[32]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[33]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[34]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[35]  Ben Lehner,et al.  Epigenetic epistatic interactions constrain the evolution of gene expression , 2013, Molecular systems biology.

[36]  B A McKinney,et al.  Surfing a genetic association interaction network to identify modulators of antibody response to smallpox vaccine , 2010, Genes and Immunity.

[37]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[38]  B. McKinney,et al.  Capturing the Spectrum of Interaction Effects in Genetic Association Studies by Simulated Evaporative Cooling Network Analysis , 2009, PLoS genetics.