Feature extraction approaches for biological sequences: a comparative study of mathematical features

As consequence of the various genomic sequencing projects, an increasing volume of biological sequence data is being produced. Although machine learning algorithms have been successfully applied to a large number of genomic sequence-related problems, the results are largely affected by the type and number of features extracted. This effect has motivated new algorithms and pipeline proposals, mainly involving feature extraction problems, in which extracting significant discriminatory information from a biological set is challenging. Considering this, our work proposes a new study of feature extraction approaches based on mathematical features (numerical mapping with Fourier, entropy and complex networks). As a case study, we analyze long non-coding RNA sequences. Moreover, we separated this work into three studies. First, we assessed our proposal with the most addressed problem in our review, e.g. lncRNA and mRNA; second, we also validate the mathematical features in different classification problems, to predict the class of lncRNA, e.g. circular RNAs sequences; third, we analyze its robustness in scenarios with imbalanced data. The experimental results demonstrated three main contributions: first, an in-depth study of several mathematical features; second, a new feature extraction pipeline; and third, its high performance and robustness for distinct RNA sequence classification. Availability:  https://github.com/Bonidia/FeatureExtraction_BiologicalSequences.

[1]  Petar Glažar,et al.  circBase: a database for circular RNAs , 2014, RNA.

[2]  Ge Gao,et al.  CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features , 2017, Nucleic Acids Res..

[3]  Lennart Martens,et al.  LNCipedia: a database for annotated human lncRNA transcript sequences and structures , 2012, Nucleic Acids Res..

[4]  Qian-Hao Zhu,et al.  PlantcircBase: A Database for Plant Circular RNAs. , 2017, Molecular plant.

[5]  Ole Winther,et al.  An introduction to deep learning on biological sequence data: examples and solutions , 2017, Bioinform..

[6]  R Zhang,et al.  Z curves, an intutive tool for visualizing and analyzing the DNA sequences. , 1994, Journal of biomolecular structure & dynamics.

[7]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[8]  Changchuan Yin,et al.  Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. , 2007, Journal of theoretical biology.

[9]  Hsiao-Lin V. Wang,et al.  Long Noncoding RNAs in Plants. , 2017, Advances in experimental medicine and biology.

[10]  Annalisa Marsico,et al.  pysster: classification of biological sequences by learning sequence and structure motifs with convolutional neural networks , 2018, Bioinform..

[11]  Caitlin M. A. Simopoulos,et al.  Prediction of plant lncRNA by ensemble machine learning classifiers , 2018, BMC Genomics.

[12]  Maozu Guo,et al.  Perspectives of Bioinformatics in Big Data Era , 2019, Current genomics.

[13]  Quan Du,et al.  Analysis of LncRNA expression in cell differentiation , 2018, RNA biology.

[14]  Changchuan Yin,et al.  A Fourier Characteristic of Coding Sequences: Origins and a Non-Fourier Approximation , 2005, J. Comput. Biol..

[15]  Cheng Wu,et al.  The characteristic landscape of lncRNAs classified by RBP-lncRNA interactions across 10 cancers. , 2017, Molecular bioSystems.

[16]  Abdollah Dehzangi,et al.  PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences , 2019, Bioinform..

[17]  P H Watson,et al.  The steroid receptor RNA activator is the first functional RNA encoding a protein , 2004, FEBS letters.

[18]  Claes Wahlestedt,et al.  Involvement of long noncoding RNAs in diseases affecting the central nervous system , 2012, RNA biology.

[19]  R. Fernando,et al.  Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction , 2017, Journal of Animal Science and Biotechnology.

[20]  Priscila Tiemi Maeda Saito,et al.  Pattern recognition analysis on long noncoding RNAs: a tool for prediction in plants , 2019, Briefings Bioinform..

[21]  Xiaoyong Pan,et al.  PredcircRNA: computational classification of circular RNA from other long non-coding RNA using hybrid features. , 2015, Molecular bioSystems.

[22]  Zeping Han,et al.  Bioinformatic analysis and prediction of the function and regulatory network of long non-coding RNAs in hepatocellular carcinoma , 2018, Oncology letters.

[23]  Syed Mansoor Raza,et al.  A Review of Computational Methods for Finding Non-Coding RNA Genes , 2016, Genes.

[24]  Taghi M. Khoshgoftaar,et al.  CatBoost for big data: an interdisciplinary review , 2020, J. Big Data.

[25]  Martin Sill,et al.  Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data , 2020, Nature Protocols.

[26]  Mohamed Chaabane,et al.  circDeep: deep learning approach for circular RNA classification from other long non-coding RNA , 2019, Bioinform..

[27]  A. Nair,et al.  A coding measure scheme employing electron-ion interaction pseudopotential (EIIP) , 2006, Bioinformation.

[28]  E. Li,et al.  CNIT: a fast and accurate web tool for identifying protein-coding and long non-coding transcripts based on intrinsic sequence composition , 2019, Nucleic Acids Res..

[29]  E. Jacobsen,et al.  The sliding DFT , 2003, IEEE Signal Process. Mag..

[30]  Aimin Li,et al.  PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme , 2014, BMC Bioinformatics.

[31]  S. Eddy Non–coding RNA genes and the modern RNA world , 2001, Nature Reviews Genetics.

[32]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Selecting the Most Relevant Features for the Identification of Long Non-Coding RNAs in Plants , 2019, BRACIS.

[33]  Yong Zhang,et al.  CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine , 2007, Nucleic Acids Res..

[34]  Luciano da Fontoura Costa,et al.  Complex networks: The key to systems biology , 2008 .

[35]  Lisa E. Gralinski,et al.  Unique Signatures of Long Noncoding RNA Expression in Response to Virus Infection and Altered Innate Immune Signaling , 2010, mBio.

[36]  L. Qu,et al.  Genome-wide screening and functional analysis identify a large number of long noncoding RNAs involved in the sexual reproduction of rice , 2014, Genome Biology.

[37]  C.M. Rader The fast Fourier transform , 1975, Proceedings of the IEEE.

[38]  Qingyu Liu,et al.  Identifying Circular RNA and Predicting Its Regulatory Interactions by Machine Learning , 2020, Frontiers in Genetics.

[39]  P. Stadler,et al.  RNA Maps Reveal New RNA Classes and a Possible Function for Pervasive Transcription , 2007, Science.

[40]  Howard Y. Chang,et al.  Unique features of long non-coding RNA biogenesis and function , 2015, Nature Reviews Genetics.

[41]  Gonzalo Martínez-Muñoz,et al.  A comparative analysis of gradient boosting algorithms , 2020, Artificial Intelligence Review.

[42]  Melissa J. Fullwood,et al.  Roles, Functions, and Mechanisms of Long Non-coding RNAs in Cancer , 2016, Genom. Proteom. Bioinform..

[43]  Yuan Zhang,et al.  LncRNA-ID: Long non-coding RNA IDentification using balanced random forests , 2015, Bioinform..

[44]  Ruifeng Hu,et al.  lncRNATargets: A platform for lncRNA target prediction based on nucleic acid thermodynamics , 2016, J. Bioinform. Comput. Biol..

[45]  Silvio C. E. Tosatto,et al.  REPETITA: detection and discrimination of the periodicity of protein solenoid repeats by discrete Fourier transform , 2009, Bioinform..

[46]  Fabricio M. Lopes,et al.  BASiNET—BiologicAl Sequences NETwork: a case study on coding and non-coding RNAs identification , 2018, Nucleic acids research.

[47]  Wenjun Liu,et al.  Puzzle of highly pathogenic human coronaviruses (2019-nCoV) , 2020, Protein & Cell.

[48]  Dongdong Sun,et al.  A text feature-based approach for literature mining of lncRNA-protein interactions , 2016, Neurocomputing.

[49]  Esra Zihni,et al.  Opening the black box of artificial intelligence for clinical decision support: A study predicting stroke outcome , 2020, PloS one.

[50]  Alexander Schliep,et al.  Comparative study on normalization procedures for cluster analysis of gene expression datasets , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[51]  Homayoun Nikookar Peak-to-average power ratio , 2013 .

[52]  David M. Goodstein,et al.  Phytozome: a comparative platform for green plant genomics , 2011, Nucleic Acids Res..

[53]  Y. Mo,et al.  Emerging roles of lncRNAs in the post-transcriptional regulation in cancer , 2019, Genes & diseases.

[54]  Jehoshua Bruck,et al.  Evolution of $k$ -Mer Frequencies and Entropy in Duplication and Substitution Mutation Systems , 2018, IEEE Transactions on Information Theory.

[55]  Roberto Marcondes Cesar Junior,et al.  Inference of gene regulatory networks from time series by Tsallis entropy , 2011, BMC Systems Biology.

[56]  Wen Zhang,et al.  The linear neighborhood propagation method for predicting long non-coding RNA-protein interactions , 2018, Neurocomputing.

[57]  Andreu Paytuví Gallart,et al.  GREENC: a Wiki-based database of plant lncRNAs , 2015, Nucleic Acids Res..

[58]  Xiao Fan Wang,et al.  Complex Networks: Topology, Dynamics and Synchronization , 2002, Int. J. Bifurc. Chaos.

[59]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[60]  Jeannie T. Lee,et al.  Long Noncoding RNAs: Past, Present, and Future , 2013, Genetics.

[61]  Vladimir B. Bajic,et al.  Characterization and identification of long non-coding RNAs based on feature relationship , 2019, Bioinform..

[62]  Lei Wang,et al.  A Novel Method for LncRNA-Disease Association Prediction Based on an lncRNA-Disease Association Network , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[63]  Xiaoyong Pan,et al.  Discriminating cirRNAs from other lncRNAs using a hierarchical extreme learning machine (H-ELM) algorithm with feature selection , 2017, Zeitschrift für Induktive Abstammungs- und Vererbungslehre.

[64]  Silvia Angeletti,et al.  The 2019‐new coronavirus epidemic: Evidence for virus evolution , 2020, Journal of medical virology.

[65]  C T Zhang A symmetrical theory of DNA sequences and its applications. , 1997, Journal of theoretical biology.

[66]  Shaowu Zhang,et al.  lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning. , 2015, Molecular bioSystems.

[67]  Trevor Hastie,et al.  Multi-class AdaBoost ∗ , 2009 .

[68]  Yuwei Zhang,et al.  Long noncoding RNA: a crosslink in biological regulatory network , 2018, Briefings Bioinform..

[69]  Jian Zhang,et al.  PlantNATsDB: a comprehensive database of plant natural antisense transcripts , 2011, Nucleic Acids Res..

[70]  J. Kocher,et al.  CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model , 2013, Nucleic acids research.

[71]  Yi Zhao,et al.  Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts , 2013, Nucleic acids research.

[72]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[73]  Alan M. Moses,et al.  Entropy and Information within Intrinsically Disordered Protein Regions , 2019, Entropy.

[74]  Leonidas D. Iasemidis,et al.  Autoregressive Modeling and Feature Analysis of DNA Sequences , 2004, EURASIP J. Adv. Signal Process..

[75]  Abdiel Ramírez-Reyes,et al.  Determining the Entropic Index q of Tsallis Entropy in Images through Redundancy , 2016, Entropy.

[76]  Zhihua Li,et al.  Survey on encoding schemes for genomic data representation and feature learning - from signal processing to machine learning , 2018, Big Data Min. Anal..

[77]  Hamid Rastegari,et al.  Intelligent mining of large-scale bio-data: Bioinformatics applications , 2018 .

[78]  Urminder Singh,et al.  PLncPRO for prediction of long non-coding RNAs (lncRNAs) in plants and its application for discovery of abiotic stress-responsive lncRNAs in rice and chickpea , 2017, Nucleic acids research.

[79]  S. Brommonschenkel,et al.  Machine learning approaches and their current application in plant molecular biology: A systematic review. , 2019, Plant science : an international journal of experimental plant biology.

[80]  Alexander Y. Liu The Effect of Oversampling and Undersampling on Classifying Imbalanced Text Datasets , 2004 .

[81]  U. Ohler,et al.  Towards a deeper annotation of human lncRNAs. , 2020, Biochimica et biophysica acta. Gene regulatory mechanisms.

[82]  Mohammed Abo-Zahhad,et al.  Genomic Analysis and Classification of Exon and Intron Sequences Using DNA Numerical Mapping Techniques , 2012 .

[83]  P D Cristea Conversion of nucleotides sequences into genomic signals , 2002, Journal of cellular and molecular medicine.

[84]  Yan Li,et al.  circRNADb: A comprehensive database for human circular RNAs with protein-coding annotations , 2016, Scientific Reports.

[85]  Sabeur Aridhi,et al.  Feature extraction in protein sequences classification: a new stability measure , 2012, BCB.

[86]  Chee Keong Kwoh,et al.  DeepCPP: a deep neural network based on nucleotide bias information and minimum distribution similarity feature selection for RNA coding potential prediction , 2020, Briefings Bioinform..

[87]  Matthew England,et al.  PLIT: An alignment-free computational tool for identification of long non-coding RNAs in plant transcriptomic datasets , 2019, Comput. Biol. Medicine.

[88]  Georgina Stegmayer,et al.  Complexity measures of the mature miRNA for improving pre-miRNAs prediction , 2019, Bioinform..

[89]  David G. Knowles,et al.  The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression , 2012, Genome research.

[90]  Yan Guo,et al.  Characterization of stress-responsive lncRNAs in Arabidopsis thaliana by integrating expression, epigenetic and structural features. , 2014, The Plant journal : for cell and molecular biology.

[91]  Fabrício Martins Lopes,et al.  Classification of texture based on Bag-of-Visual-Words through complex networks , 2019, Expert Syst. Appl..

[92]  Cong Pian,et al.  LncRNApred: Classification of Long Non-Coding RNAs and Protein-Coding Transcripts by the Ensemble Algorithm with a New Hybrid Feature , 2016, PloS one.

[93]  Susana Vinga,et al.  Information theory applications for biological sequence analysis , 2013, Briefings Bioinform..

[94]  Márcio Portes de Albuquerque,et al.  Image thresholding using Tsallis entropy , 2004, Pattern Recognit. Lett..

[95]  G. Stein,et al.  Non-coding RNAs: Epigenetic regulators of bone development and homeostasis. , 2015, Bone.

[96]  Xi Chen,et al.  Computational identification of human long intergenic non-coding RNAs using a GA-SVM algorithm. , 2014, Gene.

[97]  Yanchun Liang,et al.  LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property , 2018, Briefings Bioinform..

[98]  Jia Meng,et al.  lncRScan-SVM: A Tool for Predicting Long Non-Coding RNAs Using Support Vector Machine , 2015, PloS one.

[99]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[100]  Jianfeng Shao,et al.  SNR of DNA sequences mapped by general affine transformations of the indicator sequences , 2013, Journal of mathematical biology.

[101]  Xiangfeng Wang,et al.  Machine learning for Big Data analytics in plants. , 2014, Trends in plant science.

[102]  Byunghan Lee,et al.  LncRNAnet: long non‐coding RNA identification using deep learning , 2018, Bioinform..

[103]  V. Bajic,et al.  On the classification of long non-coding RNAs , 2013, RNA biology.

[104]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[105]  Sanjiv Kumar,et al.  A Survey of Modern Questions and Challenges in Feature Extraction , 2015, FE@NIPS.

[106]  Annick Lesne,et al.  Shannon entropy: a rigorous notion at the crossroads between probability, information theory, dynamical systems and statistical physics , 2014, Mathematical Structures in Computer Science.

[107]  Feng Liu,et al.  PredLnc-GFStack: A Global Sequence Feature Based on a Stacked Ensemble Learning Method for Predicting lncRNAs from Transcripts , 2019, Genes.

[108]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[109]  D. Adelson,et al.  Transposable elements (TEs) contribute to stress‐related long intergenic noncoding RNAs in plants , 2017, The Plant journal : for cell and molecular biology.

[110]  Kesari Verma,et al.  Investigations on Impact of Feature Normalization Techniques on Classifier's Performance in Breast Tumor Classification , 2015 .

[111]  Pritish Kumar Varadwaj,et al.  DeepLNC, a long non-coding RNA prediction tool using deep neural network , 2016, Network Modeling Analysis in Health Informatics and Bioinformatics.

[112]  Cheng Huang,et al.  Long noncoding RNAs: Novel insights into hepatocelluar carcinoma. , 2014, Cancer letters.

[113]  Clícia Grativol,et al.  PlantRNA_Sniffer: A SVM-Based Workflow to Predict Long Intergenic Non-Coding RNAs in Plants , 2017, Non-coding RNA.

[114]  Dimitris Anastassiou,et al.  Genomic signal processing , 2001, IEEE Signal Process. Mag..

[115]  Geoffrey I. Webb,et al.  iLearn : an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data , 2019, Briefings Bioinform..

[116]  Anna Veronika Dorogush,et al.  CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[117]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[118]  Ying Chen,et al.  A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering. , 2014, Journal of theoretical biology.

[119]  Milton Pividori,et al.  Predicting novel microRNA: a comprehensive comparison of machine learning approaches , 2019, Briefings Bioinform..

[120]  Gerardo Mendizabal-Ruiz,et al.  On DNA numerical representations for genomic similarity computation , 2017, PloS one.

[121]  G. Helt,et al.  Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution , 2005, Science.

[122]  J. A. Tenreiro Machado,et al.  Shannon, Rényie and Tsallis entropy analysis of DNA using phase plane , 2011 .