A Novel Decomposing Model With Evolutionary Algorithms for Feature Selection in Long Non-Coding RNAs

Machine learning algorithms have been applied to numerous transcript datasets to identify Long non-coding RNAs (lncRNAs). Nevertheless, before these algorithms are applied to RNA data, features must be extracted from the original sequences. As many of these features can be redundant or irrelevant, the predictive performance of the algorithms can be improved by performing feature selection. However, the most current approaches usually select features independently, ignoring possible relations. In this paper, we propose a new model, which identifies the best subsets, removing unnecessary, irrelevant, and redundant predictive features, taking the importance of their co-occurrence into account. The proposed model is based on decomposing solutions and is called $k$ -rounds of decomposition features. In this model, the least relevant features are suppressed according to their contribution to a classification task. To evaluate our proposal, we extract from 5 plant species datasets, a set of features based on sequence structures, using GC content, k-mer (1-6), sequence length, and Open Reading Frame. Next, we apply 5 metaheuristics approaches (Genetic Algorithm, ( $\mu + \lambda $ ) Evolutionary Algorithm, Artificial Bee Colony, Ant Colony Optimization, and Particle Swarm Optimization) to select the best feature subsets. The main contribution of this work was to include in each metaheuristic a decomposition model that uses round and voting scheme. To investigate its relevance, we select the REPTree classifier to assess the predictive capacity of each subset of features selected in 8 plant species. We identified that the inclusion of the proposed decomposition model significantly reduces the dimensions of the datasets and improves predictive performance, regardless of the metaheuristic. Furthermore, the resulting pipeline has been compared with five approaches in the literature, for lncRNA, when it also showed superior predictive performance. Finally, this study generated a new pipeline to find a minimum number of features in lncRNAs and biological sequences.

[1]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[2]  Anupam Shukla,et al.  A survey of nature-inspired algorithms for feature selection to identify Parkinson's disease , 2017, Comput. Methods Programs Biomed..

[3]  Yong Zhang,et al.  CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine , 2007, Nucleic Acids Res..

[4]  Cheng Wu,et al.  The characteristic landscape of lncRNAs classified by RBP-lncRNA interactions across 10 cancers. , 2017, Molecular bioSystems.

[5]  Shaowu Zhang,et al.  lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning. , 2015, Molecular bioSystems.

[6]  Xin Yao,et al.  A Survey on Evolutionary Computation Approaches to Feature Selection , 2016, IEEE Transactions on Evolutionary Computation.

[7]  Yuliang Lu,et al.  Feature Selection for Image Steganalysis Using Binary Bat Algorithm , 2020, IEEE Access.

[8]  Julian Togelius,et al.  Geometric particle swarm optimization , 2008 .

[9]  Dongdong Sun,et al.  A text feature-based approach for literature mining of lncRNA-protein interactions , 2016, Neurocomputing.

[10]  D. Adelson,et al.  Transposable elements (TEs) contribute to stress‐related long intergenic noncoding RNAs in plants , 2017, The Plant journal : for cell and molecular biology.

[11]  David M. Goodstein,et al.  Phytozome: a comparative platform for green plant genomics , 2011, Nucleic Acids Res..

[12]  Jian Cheng,et al.  Multi-Objective Particle Swarm Optimization Approach for Cost-Based Feature Selection in Classification , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  Jin-Wu Nam,et al.  TERIUS: accurate prediction of lncRNA via high-throughput sequencing data representing RNA-binding protein association , 2018, BMC Bioinformatics.

[14]  Zaid Abdi Alkareem Alyasseri,et al.  EEG Signal Denoising Using Hybridizing Method Between Wavelet Transform with Genetic Algorithm , 2020 .

[15]  Jesper Tegnér,et al.  Consistent Feature Selection for Pattern Recognition in Polynomial Time , 2007, J. Mach. Learn. Res..

[16]  Francisco Herrera,et al.  Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI , 2020, Inf. Fusion.

[17]  W J S Diniz,et al.  REVIEW-ARTICLE Bioinformatics: an overview and its applications. , 2017, Genetics and molecular research : GMR.

[18]  Hon Keung Kwan,et al.  Numerical representation of DNA sequences , 2009, 2009 IEEE International Conference on Electro/Information Technology.

[19]  Dun-Wei Gong,et al.  A return-cost-based binary firefly algorithm for feature selection , 2017, Inf. Sci..

[20]  Dong Wang,et al.  iLoc‐lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC , 2018, Bioinform..

[21]  Mohamed Elhoseny,et al.  Feature selection based on artificial bee colony and gradient boosting decision tree , 2019, Appl. Soft Comput..

[22]  G. Stein,et al.  Non-coding RNAs: Epigenetic regulators of bone development and homeostasis. , 2015, Bone.

[23]  Jason Weston,et al.  Embedded Methods , 2006, Feature Extraction.

[24]  Yanchun Liang,et al.  LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property , 2018, Briefings Bioinform..

[25]  Yong Xia,et al.  GA-SVM based feature selection and parameter optimization in hospitalization expense modeling , 2019, Appl. Soft Comput..

[26]  Cong Pian,et al.  LncRNApred: Classification of Long Non-Coding RNAs and Protein-Coding Transcripts by the Ensemble Algorithm with a New Hybrid Feature , 2016, PloS one.

[27]  Shihua Zhang,et al.  PLNlncRbase: A resource for experimentally identified lncRNAs in plants. , 2015, Gene.

[28]  Jia Meng,et al.  lncRScan-SVM: A Tool for Predicting Long Non-Coding RNAs Using Support Vector Machine , 2015, PloS one.

[29]  Boonserm Kaewkamnerdpong,et al.  Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm , 2014, Nucleic acids research.

[30]  Urminder Singh,et al.  PLncPRO for prediction of long non-coding RNAs (lncRNAs) in plants and its application for discovery of abiotic stress-responsive lncRNAs in rice and chickpea , 2017, Nucleic acids research.

[31]  Zeping Han,et al.  Bioinformatic analysis and prediction of the function and regulatory network of long non-coding RNAs in hepatocellular carcinoma , 2018, Oncology letters.

[32]  Xin-She Yang,et al.  BBA: A Binary Bat Algorithm for Feature Selection , 2012, 2012 25th SIBGRAPI Conference on Graphics, Patterns and Images.

[33]  Andreu Paytuví Gallart,et al.  GREENC: a Wiki-based database of plant lncRNAs , 2015, Nucleic Acids Res..

[34]  Marco Dorigo,et al.  Ant system: optimization by a colony of cooperating agents , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[35]  Claes Wahlestedt,et al.  Involvement of long noncoding RNAs in diseases affecting the central nervous system , 2012, RNA biology.

[36]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..

[37]  Yuan Zhang,et al.  LncRNA-ID: Long non-coding RNA IDentification using balanced random forests , 2015, Bioinform..

[38]  Priscila Tiemi Maeda Saito,et al.  Pattern recognition analysis on long noncoding RNAs: a tool for prediction in plants , 2019, Briefings Bioinform..

[39]  Ruifeng Hu,et al.  lncRNATargets: A platform for lncRNA target prediction based on nucleic acid thermodynamics , 2016, J. Bioinform. Comput. Biol..

[40]  Yi Zhao,et al.  Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts , 2013, Nucleic acids research.

[41]  Mengjie Zhang,et al.  Particle Swarm Optimization for Feature Selection in Classification: A Multi-Objective Approach , 2013, IEEE Transactions on Cybernetics.

[42]  Alexander Schliep,et al.  Comparative study on normalization procedures for cluster analysis of gene expression datasets , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[43]  Saba Valadkhan,et al.  Computational analysis of functional long noncoding RNAs reveals lack of peptide-coding capacity and parallels with 3' UTRs. , 2012, RNA.

[44]  Clícia Grativol,et al.  PlantRNA_Sniffer: A SVM-Based Workflow to Predict Long Intergenic Non-Coding RNAs in Plants , 2017, Non-coding RNA.

[45]  Victor-Emil Neagoe,et al.  Feature selection with Ant Colony Optimization and its applications for pattern recognition in space imagery , 2016, 2016 International Conference on Communications (COMM).

[46]  Filippo Menczer,et al.  Feature selection in unsupervised learning via evolutionary search , 2000, KDD '00.

[47]  Yong Zhang,et al.  Cost-sensitive feature selection using two-archive multi-objective artificial bee colony algorithm , 2019, Expert Syst. Appl..

[48]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[49]  Mohammed Azmi Al-Betar,et al.  Particle Swarm optimization Algorithm for Power Scheduling Problem Using Smart Battery , 2019, 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT).

[50]  J. Kocher,et al.  CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model , 2013, Nucleic acids research.

[51]  Duc Truong Pham,et al.  The Bees Algorithm: Modelling foraging behaviour to solve continuous optimization problems , 2009 .

[52]  Xin-She Yang,et al.  Bat algorithm: literature review and applications , 2013, Int. J. Bio Inspired Comput..

[53]  Xi Chen,et al.  Computational identification of human long intergenic non-coding RNAs using a GA-SVM algorithm. , 2014, Gene.

[54]  L. Qu,et al.  Genome-wide screening and functional analysis identify a large number of long noncoding RNAs involved in the sexual reproduction of rice , 2014, Genome Biology.

[55]  Manuel López-Ibáñez,et al.  Ant colony optimization , 2010, GECCO '10.

[56]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[57]  Bo Wang,et al.  Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities , 2018, Inf. Fusion.

[58]  Yan Guo,et al.  Characterization of stress-responsive lncRNAs in Arabidopsis thaliana by integrating expression, epigenetic and structural features. , 2014, The Plant journal : for cell and molecular biology.

[59]  Wen Zhang,et al.  The linear neighborhood propagation method for predicting long non-coding RNA-protein interactions , 2018, Neurocomputing.

[60]  Adam Slowik,et al.  Evolutionary algorithms and their applications to engineering problems , 2020, Neural Computing and Applications.

[61]  Xiaoyan Sun,et al.  Variable-Size Cooperative Coevolutionary Particle Swarm Optimization for Feature Selection on High-Dimensional Data , 2020, IEEE Transactions on Evolutionary Computation.

[62]  Yong Xia,et al.  A tribe competition-based genetic algorithm for feature selection in pattern classification , 2017, Appl. Soft Comput..

[63]  Aimin Li,et al.  PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme , 2014, BMC Bioinformatics.

[64]  Ge Gao,et al.  CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features , 2017, Nucleic Acids Res..

[65]  Urszula Stanczyk,et al.  Feature Evaluation by Filter, Wrapper, and Embedded Approaches , 2015, Feature Selection for Data and Pattern Recognition.

[66]  Pritish Kumar Varadwaj,et al.  DeepLNC, a long non-coding RNA prediction tool using deep neural network , 2016, Network Modeling Analysis in Health Informatics and Bioinformatics.

[67]  Dervis Karaboga,et al.  A comprehensive survey: artificial bee colony (ABC) algorithm and applications , 2012, Artificial Intelligence Review.

[68]  Ze Zhang,et al.  BmncRNAdb: a comprehensive database of non-coding RNAs in the silkworm, Bombyx mori , 2016, BMC Bioinformatics.

[69]  Yasir Hamid,et al.  Feature selection techniques for intrusion detection using non-bio-inspired and bio-inspired optimization algorithms , 2017, Journal of Communications and Information Networks.

[70]  Geng Sun,et al.  Bio-Inspired Feature Selection: An Improved Binary Particle Swarm Optimization Approach , 2020, IEEE Access.

[71]  Highly Dynamic and Sex-Specific Expression of microRNAs During Early ES Cell Differentiation , 2009, PLoS genetics.

[72]  K. Selvakuberan,et al.  Combined Feature Selection and classification – A novel approach for the categorization of web pages , 2008 .

[73]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[74]  Reha Uzsoy,et al.  Experimental Evaluation of Heuristic Optimization Algorithms: A Tutorial , 2001, J. Heuristics.

[75]  Ke Li,et al.  Key Anti-Fibrosis Associated Long Noncoding RNAs Identified in Human Hepatic Stellate Cell via Transcriptome Sequencing Analysis , 2018, International journal of molecular sciences.

[76]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[77]  Deepak Gupta,et al.  Bat-inspired algorithm for feature selection and white blood cell classification , 2020 .

[78]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[79]  K. Lindblad-Toh,et al.  FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome , 2017, Nucleic acids research.

[80]  Dunwei Gong,et al.  Binary differential evolution with self-learning for multi-objective feature selection , 2020, Inf. Sci..

[81]  Lei Wang,et al.  A Novel Method for LncRNA-Disease Association Prediction Based on an lncRNA-Disease Association Network , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[82]  K. Sun,et al.  iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data , 2013, BMC Genomics.

[83]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[84]  Fabricio M. Lopes,et al.  BASiNET—BiologicAl Sequences NETwork: a case study on coding and non-coding RNAs identification , 2018, Nucleic acids research.

[85]  Muhammad Aamer Mehmood,et al.  Use of Bioinformatics Tools in Different Spheres of Life Sciences , 2014 .

[86]  Robson Parmezan Bonidia,et al.  Selecting the Most Relevant Features for the Identification of Long Non-Coding RNAs in Plants , 2019, 2019 8th Brazilian Conference on Intelligent Systems (BRACIS).

[87]  Mauro Birattari,et al.  Swarm Intelligence , 2012, Lecture Notes in Computer Science.

[88]  Hsiao-Lin V. Wang,et al.  Long Noncoding RNAs in Plants. , 2017, Advances in experimental medicine and biology.

[89]  Wei Zhao,et al.  A Binary Superior Tracking Artificial Bee Colony for Feature Selection , 2020 .

[90]  Beatriz de la Iglesia,et al.  Evolutionary computation for feature selection in classification problems , 2013, WIREs Data Mining Knowl. Discov..

[91]  Simon Fong,et al.  Swarm Search Methods in Weka for Data Mining , 2018, ICMLC.

[92]  Ahamad Tajudin Khader,et al.  EEG Signals Denoising Using Optimal Wavelet Transform Hybridized With Efficient Metaheuristic Methods , 2020, IEEE Access.

[93]  C. Bult,et al.  Discrimination of Non-Protein-Coding Transcripts from Protein-Coding mRNA , 2006, RNA biology.

[94]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[95]  Hélio Pedrini,et al.  Data feature selection based on Artificial Bee Colony algorithm , 2013, EURASIP J. Image Video Process..