DaMiRseq - an R/Bioconductor package for data mining of RNA-Seq data: normalization, feature selection and classification

Summary RNA-Seq is becoming the technique of choice for high-throughput transcriptome profiling, which, besides class comparison for differential expression, promises to be an effective and powerful tool for biomarker discovery. However, a systematic analysis of high-dimensional genomic data is a demanding task for such a purpose. DaMiRseq offers an organized, flexible and convenient framework to remove noise and bias, select the most informative features and perform accurate classification. Availability and implementation DaMiRseq is developed for the R environment (R ≥ 3.4) and is released under GPL (≥2) License. The package runs on Windows, Linux and Macintosh operating systems and is freely available to non-commercial users at the Bioconductor open-source, open-development software project repository (https://bioconductor.org/packages/DaMiRseq/). In compliance with Bioconductor standards, the authors ensure stable package maintenance through software and documentation updates. Contact luca.piacentini@ccfm.it. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Rafael A. Irizarry,et al.  quantro: a data-driven approach to guide the choice of an appropriate normalization method , 2015, Genome Biology.

[2]  I. E. Frank Intermediate least squares regression method , 1987 .

[3]  Verónica Bolón-Canedo,et al.  Feature Selection for High-Dimensional Data , 2015, Artificial Intelligence: Foundations, Theory, and Algorithms.

[4]  Fredrik Levander,et al.  Normalyzer: A Tool for Rapid Evaluation of Normalization Methods for Omics Data Sets , 2014, Journal of proteome research.

[5]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[6]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[7]  J. Leek Asymptotic Conditional Singular Value Decomposition for High‐Dimensional Genomic Data , 2011, Biometrics.

[8]  Andrew E. Jaffe,et al.  Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis , 2015, BMC Bioinformatics.

[9]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[10]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[11]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[12]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[13]  A. Buja,et al.  Remarks on Parallel Analysis. , 1992, Multivariate behavioral research.

[14]  Cedric E. Ginestet ggplot2: Elegant Graphics for Data Analysis , 2011 .

[15]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[16]  Daniela M. Witten,et al.  Classification and clustering of sequencing data using a poisson model , 2011, 1202.6201.

[17]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[18]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[19]  Dincer Goksuluk,et al.  A comprehensive simulation study on classification of RNA-Seq data , 2017, PloS one.

[20]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[21]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[22]  S. Dudoit,et al.  Normalization of RNA-seq data using factor analysis of control genes or samples , 2014, Nature Biotechnology.

[23]  Xing Qiu,et al.  Correlation Between Gene Expression Levels and Limitations of the Empirical Bayes Methodology for Finding Differentially Expressed Genes , 2005, Statistical applications in genetics and molecular biology.

[24]  Tahir Mehmood,et al.  A review of variable selection methods in Partial Least Squares Regression , 2012 .

[25]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[26]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..