New Intraclass Helitrons Classification Using DNA-Image Sequences and Machine Learning Approaches

Abstract Helitrons, eukaryotic transposable elements (TEs) transposed by rolling-circle mechanism, have been found in various species with highly variable copy numbers and sometimes with a large portion of their genomes. The impact of helitrons sequences in the genome is to frequently capture host genes during their transposition. Since their discovery, 18 years ago, by computational analysis of whole genome sequences of Arabidopsis thaliana plant and Caenorhabditis elegans (C. elegans) nematode, the identification and classification of these mobile genetic elements remain a challenge due to the fact that the wide majority of their families are non-autonomous. In C. elegans genome, DNA helitrons sequences possess great variability in terms of length that varies between 11 and 8965 base pairs (bps) from one sequence to another. In this work, we develop a new method to predict helitrons DNA-sequences, which is particularly based on Frequency Chaos Game Representation (FCGR) DNA-images. Thus, we introduce an automatic system in order to classify helitrons families in C. elegans genome, based on a combination between machine learning approaches and features extracted from DNA-sequences. Consequently, the new set of helitrons features (the FCGR images and K-mers) are extracted from DNA sequences. These helitrons features consist of the frequency apparition number of K nucleotides pairs (Tandem Repeat) in the DNA sequences. Indeed, three different classifiers are used for the classification of all existing helitrons families. The results have shown potential global score equal to 72.7% due to FCGR images which constitute helitrons features and the pre-trained neural network as a classifier. The two other classifiers demonstrate that their efficiency reaches 68.7% for Support Vector Machine (SVM) and 91.45% for Random Forest (RF) algorithms using the K-mers features corresponding to the genomic sequences.

[1]  George C. Runger,et al.  Gene selection with guided regularized random forest , 2012, Pattern Recognit..

[2]  Bernhard Schölkopf,et al.  The Kernel Trick for Distances , 2000, NIPS.

[3]  Prabina Kumar Meher,et al.  Identification of species based on DNA barcode using k-mer feature vector and Random forest classifier. , 2016, Gene.

[4]  Zied Lachiri,et al.  SVM Helitrons recognition based on features extracted from the FCGS representation , 2017, 2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP).

[5]  Robert J. Baker,et al.  Rolling-Circle Transposons Catalyze Genomic Innovation in a Mammalian Lineage , 2014, Genome biology and evolution.

[6]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[7]  Zied Lachiri,et al.  A combined support vector machine-FCGS classification based on the wavelet transform for Helitrons recognition in C.elegans , 2018, Multimedia Tools and Applications.

[8]  Israel M. Martínez-Pérez,et al.  Accurate classification of immunomodulatory RNA sequences , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[9]  Cédric Feschotte,et al.  Massive amplification of rolling-circle transposons in the lineage of the bat Myotis lucifugus , 2007, Proceedings of the National Academy of Sciences.

[10]  Zied Lachiri,et al.  Distinguishing between intra-genomic helitron families using time-frequency features and random forest approaches , 2019, Biomed. Signal Process. Control..

[11]  Kyaw Thet Khaing,et al.  Detection Model for Daniel-of-Service Attacks using Random Forest and k-Nearest Neighbors , 2013 .

[12]  P. K. Sinha,et al.  Weighted Hybrid Decision Tree Model for Random Forest Classifier , 2015, Journal of The Institution of Engineers (India): Series B.

[13]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[14]  Ø. Hammer,et al.  PAST: PALEONTOLOGICAL STATISTICAL SOFTWARE PACKAGE FOR EDUCATION AND DATA ANALYSIS , 2001 .

[15]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[16]  Pinpointing the vesper bat transposon revolution using the Miniopterus natalensis genome , 2016, Mobile DNA.

[17]  Werner Henkel,et al.  Categorization of species based on their microRNAs employing sequence motifs, information-theoretic sequence feature extraction, and k-mers , 2017, EURASIP J. Adv. Signal Process..

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[20]  M. Hood,et al.  Repetitive DNA in the automictic fungus Microbotryum violaceum , 2005, Genetica.

[21]  Peer Bork,et al.  Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy , 2011, Nucleic Acids Res..

[22]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[23]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  J. Jurka,et al.  Helitrons on a roll: eukaryotic rolling-circle transposons. , 2007, Trends in genetics : TIG.

[25]  Dalit Levy,et al.  Species Categorization via MicroRNAs - Based on 3'UTR Target Sites using Sequence Features , 2018, BIOINFORMATICS.

[26]  Robert D. Finn,et al.  Dfam: a database of repetitive DNA based on profile hidden Markov models , 2012, Nucleic Acids Res..

[27]  Béla Pataki,et al.  Classification confidence weighted majority voting using decision tree classifiers , 2008, Int. J. Intell. Comput. Cybern..

[28]  Jens Allmer,et al.  Systematic computational analysis of potential RNAi regulation in Toxoplasma gondii , 2010, 2010 5th International Symposium on Health Informatics and Bioinformatics.

[29]  D. Bartosik,et al.  The Different Faces of Rolling-Circle Replication and Its Multifunctional Initiator Proteins , 2017, Front. Microbiol..

[30]  Jonas S. Almeida,et al.  Analysis of genomic sequences by Chaos Game Representation , 2001, Bioinform..

[31]  Roger W. Johnson,et al.  An Introduction to the Bootstrap , 2001 .

[32]  R. Poulter,et al.  Vertebrate helentrons and other novel Helitrons. , 2003, Gene.

[33]  Robert I. Damper,et al.  Multi-class and hierarchical SVMs for emotion recognition , 2010, INTERSPEECH.

[34]  Ellen J. Pritham,et al.  Helitrons, the Eukaryotic Rolling-circle Transposable Elements , 2015, Microbiology spectrum.

[35]  Hsuan-Tien Lin A Study on Sigmoid Kernels for SVM and the Training of non-PSD Kernels by SMO-type Methods , 2005 .

[36]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[37]  J. Jurka,et al.  Rolling-circle transposons in eukaryotes , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[38]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[39]  S. Rigatti Random Forest. , 2017, Journal of insurance medicine.

[40]  O Hammer-Muntz,et al.  PAST: paleontological statistics software package for education and data analysis version 2.09 , 2001 .

[41]  N. Goldman,et al.  Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. , 1993, Nucleic acids research.

[42]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[43]  Zied Lachiri,et al.  The Helitron family classification using SVM based on Fourier transform features applied on an unbalanced dataset , 2019, Medical & Biological Engineering & Computing.

[44]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[45]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[46]  Kenji Satou,et al.  Combined Use of k-Mer Numerical Features and Position-Specific Categorical Features in Fixed-Length DNA Sequence Classification , 2017 .

[47]  Antonino Fiannaca,et al.  A k-mer-based barcode DNA classification methodology based on spectral representation and a neural gas network , 2015, Artif. Intell. Medicine.