Deep Neural Architectures for Highly Imbalanced Data in Bioinformatics

In the postgenome era, many problems in bioinformatics have arisen due to the generation of large amounts of imbalanced data. In particular, the computational classification of precursor microRNA (pre-miRNA) involves a high imbalance in the classes. For this task, a classifier is trained to identify RNA sequences having the highest chance of being miRNA precursors. The big issue is that well-known pre-miRNAs are usually just a few in comparison to the hundreds of thousands of candidate sequences in a genome, which results in highly imbalanced data. This imbalance has a strong influence on most standard classifiers and, if not properly addressed, the classifier is not able to work properly in a real-life scenario. This work provides a comparative assessment of recent deep neural architectures for dealing with the large imbalanced data issue in the classification of pre-miRNAs. We present and analyze recent architectures in a benchmark framework with genomes of animals and plants, with increasing imbalance ratios up to 1:2000. We also propose a new graphical way for comparing classifiers performance in the context of high-class imbalance. The comparative results obtained show that, at a very high imbalance, deep belief neural networks can provide the best performance.

[1]  Yoshua Bengio,et al.  Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[2]  Jens Allmer,et al.  Computational methods for ab initio detection of microRNAs , 2012, Front. Gene..

[3]  Wenbin Li,et al.  PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs , 2011, Bioinform..

[4]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[5]  Jens Allmer,et al.  Machine learning methods for microRNA gene prediction. , 2014, Methods in molecular biology.

[6]  Xiaolong Wang,et al.  miRNA-dis: microRNA precursor identification based on distance structure status pairs. , 2015, Molecular bioSystems.

[7]  Marek Sikora,et al.  HuntMi: an efficient and taxon-specific approach in pre-miRNA identification , 2013, BMC Bioinformatics.

[8]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[9]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[10]  Lee Sael,et al.  DP-miRNA: An improved prediction of precursor microRNA using deep learning model , 2017, 2017 IEEE International Conference on Big Data and Smart Computing (BigComp).

[11]  Isabella Castiglioni,et al.  MicroRNAs: New Biomarkers for Diagnosis, Prognosis, Therapy Prediction and Therapeutic Tools for Breast Cancer , 2015, Theranostics.

[12]  R. Gregory,et al.  MicroRNA biogenesis pathways in cancer , 2015, Nature Reviews Cancer.

[13]  R. Ji,et al.  Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Peter F. Stadler,et al.  Hairpins in a Haystack: recognizing microRNA precursors in comparative genomics data , 2006, ISMB.

[15]  Bin Fan,et al.  MiRFinder: an improved approach and software implementation for genome-wide fast microRNA precursor scans , 2007, BMC Bioinformatics.

[16]  Ola R. Snøve,et al.  Reliable prediction of Drosha processing sites improves microRNA gene prediction. , 2007, Bioinformatics.

[17]  Feng Luo,et al.  MultiMotifMaker: A Multi-Thread Tool for Identifying DNA Methylation Motifs from Pacbio Reads , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Milton Pividori,et al.  Predicting novel microRNA: a comprehensive comparison of machine learning approaches , 2019, Briefings Bioinform..

[19]  Vaibhav Shukla,et al.  A compilation of Web-based research tools for miRNA analysis , 2017, Briefings in functional genomics.

[20]  Xin Yao,et al.  Dynamic Sampling Approach to Training Neural Networks for Multiclass Imbalance Classification , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[21]  Tzong-Yi Lee,et al.  ViralmiR: a support-vector-machine-based method for predicting viral microRNA precursors , 2015, BMC Bioinformatics.

[22]  Kyle K. Biggar,et al.  A framework for improving microRNA prediction in non-human genomes , 2015, Nucleic acids research.

[23]  Yong Peng,et al.  The role of MicroRNAs in human cancer , 2016, Signal Transduction and Targeted Therapy.

[24]  ShangJennifer,et al.  Learning from class-imbalanced data , 2017 .

[25]  Amir Hussain,et al.  Applications of Deep Learning and Reinforcement Learning to Biological Data , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[26]  Spiridon D. Likothanassis,et al.  YamiPred: A Novel Evolutionary Method for Predicting Pre-miRNAs and Selecting Relevant Features , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  Vasile Palade,et al.  microPred: effective classification of pre-miRNAs for human miRNA gene prediction , 2009, Bioinform..

[28]  Christian Igel,et al.  An Introduction to Restricted Boltzmann Machines , 2012, CIARP.

[29]  Georgina Stegmayer,et al.  miRNAfe: A comprehensive tool for feature extraction in microRNA prediction , 2015, Biosyst..

[30]  Nicolas Le Roux,et al.  Representational Power of Restricted Boltzmann Machines and Deep Belief Networks , 2008, Neural Computation.

[31]  Shuigeng Zhou,et al.  MiRenSVM: towards better prediction of microRNA precursors using an ensemble SVM classifier with multi-loop features , 2010, BMC Bioinformatics.

[32]  Georgina Stegmayer,et al.  Data Mining Over Biological Datasets: An Integrated Approach Based on Computational Intelligence , 2012, IEEE Computational Intelligence Magazine.

[33]  Rok Blagus,et al.  SMOTE for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[34]  Mihaela Zavolan,et al.  Identification of Clustered Micrornas Using an Ab Initio Prediction Method , 2022 .

[35]  Bo Wei,et al.  MiRPara: a SVM-based software tool for prediction of most probable microRNA coding regions in genome scale sequences , 2011, BMC Bioinformatics.

[36]  Zhongwei Si,et al.  Learning Deep Features for DNA Methylation Data Analysis , 2016, IEEE Access.

[37]  Xiaoou Tang,et al.  Discriminative Sparse Neighbor Approximation for Imbalanced Learning , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[38]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[39]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[40]  Li Li,et al.  Computational approaches for microRNA studies: a review , 2010, Mammalian Genome.

[41]  B. Lenhard,et al.  Mammalian MicroRNA Prediction through a Support Vector Machine Model of Sequence and Structure , 2007, PloS one.

[42]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[43]  Fei Li,et al.  Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine , 2005, BMC Bioinformatics.

[44]  Georgina Stegmayer,et al.  High Class-Imbalance in pre-miRNA Prediction: A Novel Approach Based on deepSOM , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[45]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[46]  Ying Zhang,et al.  A novel method to identify pre-microRNA in various species knowledge base on various species , 2017, Journal of Biomedical Semantics.

[47]  SætromPål,et al.  Reliable prediction of Drosha processing sites improves microRNA gene prediction , 2007 .