Classification of Transposable Elements by Convolutional Neural Networks

The correct classification of transposable elements (TEs) present in the genomes is crucial to understand the real role and the consequences of these elements on the organisms. Here we present a method that classifies TEs by training a CNN to label them in classes, orders and superfamilies. Unlike previous works in the literature, the proposed method does not search for similarities to classify the sequences or use traditional machine learning classifiers. Instead of that, it automatically extracts features and classify the sequences by the CNN itself. We performed an extensive experimental evaluation, analyzing our proposed method under different scenarios. It was capable to classify TEs’ sequences from various datasets in 9 different superfamilies and obtained an accuracy of \(94\%\). We also present comparisons between the proposed method and other state-of-the-art classification tools (PASTEC, REPCLASS and TECLASS), our method presents very promising results, outperforming PASTEC and REPCLASS.

[1]  T. Wicker,et al.  TREP: a database for Triticeae repetitive elements , 2002 .

[2]  Thomas Nussbaumer,et al.  PGSB PlantsDB: updates to the database framework for comparative plant genome research , 2015, Nucleic Acids Res..

[3]  György Abrusán,et al.  TEclass - a tool for automated classification of unknown eukaryotic transposable elements , 2009, Bioinform..

[4]  J. Bennetzen,et al.  A unified classification system for eukaryotic transposable elements , 2007, Nature Reviews Genetics.

[5]  C. Feschotte,et al.  Regulatory activities of transposable elements: from conflicts to benefits , 2016, Nature Reviews Genetics.

[6]  Nirmal Ranganathan,et al.  Exploring Repetitive DNA Landscapes Using REPCLASS, a Tool That Automates the Classification of Transposable Elements in Eukaryotic Genomes , 2009, Genome biology and evolution.

[7]  S. Jackson,et al.  RiTE database: a resource database for genus-wide rice genomics and evolutionary biology , 2015, BMC Genomics.

[8]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[9]  H. Quesneville,et al.  PASTEC: An Automatic Transposable Element Classification Tool , 2014, PloS one.

[10]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[11]  Lian-Feng Gu,et al.  DPTEdb, an integrative database of transposable elements in dioecious plants , 2016, Database J. Biol. Databases Curation.

[12]  J. Jurka,et al.  Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.