DeepTE: a computational method for de novo classification of transposons with convolutional neural network.

MOTIVATION Transposable elements (TEs) classification is an essential step to decode their roles in genome evolution. With a large number of genomes from non-model species becoming available, accurate and efficient TE classification has emerged as a new challenge in genomic sequence analysis. RESULTS We developed a novel tool, DeepTE, which classifies unknown TEs using convolutional neural networks. DeepTE transferred sequences into input vectors based on k-mer counts. A tree structured classification process was used where eight models were trained to classify TEs into super families and orders. DeepTE also detected domains inside TEs to correct false classification. An additional model was trained to distinguish between non-TEs and TEs in plants. Given unclassified TEs of different species, DeepTE can classify TEs into seven orders, which include 15, 24, and 16 super families in plants, metazoans, and fungi, respectively. In several benchmarking tests, DeepTE outperformed other existing tools for TE classification. In conclusion, DeepTE successfully leverages convolutional neural network for TE classification, and can be used to precisely classify TEs in newly sequenced eukaryotic genomes. AVAILABILITY DeepTE is accessible at https://github.com/LiLabAtVT/DeepTE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  G. Bourque,et al.  Ten things you should know about transposable elements , 2018, Genome Biology.

[2]  Guoli Ji,et al.  detectMITE: A novel approach to detect miniature inverted repeat transposable elements in genomes , 2016, Scientific Reports.

[3]  Zhao Xu,et al.  LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons , 2007, Nucleic Acids Res..

[4]  J. Bennetzen,et al.  Nested Retrotransposons in the Intergenic Regions of the Maize Genome , 1996, Science.

[5]  Thomas Nussbaumer,et al.  PGSB PlantsDB: updates to the database framework for comparative plant genome research , 2015, Nucleic Acids Res..

[6]  Antonino Fiannaca,et al.  A k-mer-based barcode DNA classification methodology based on spectral representation and a neural gas network , 2015, Artif. Intell. Medicine.

[7]  Neil Salkind,et al.  Encyclopedia of research design , 2010 .

[8]  You-jie Zhao,et al.  LTRtype, an Efficient Tool to Characterize Structurally Complex LTR Retrotransposons and Nested Insertions on Genomes , 2017, Front. Plant Sci..

[9]  Yasubumi Sakakibara,et al.  Convolutional neural networks for classification of alignments of non-coding RNA sequences , 2018, Bioinform..

[10]  Stefan Kurtz,et al.  LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons , 2008, BMC Bioinformatics.

[11]  Fabian J Theis,et al.  Deep learning: new computational modelling techniques for genomics , 2019, Nature Reviews Genetics.

[12]  Manolis Kellis,et al.  Deep learning for regulatory genomics , 2015, Nature Biotechnology.

[13]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[14]  Shuigeng Zhou,et al.  MiRenSVM: towards better prediction of microRNA precursors using an ensemble SVM classifier with multi-loop features , 2010, BMC Bioinformatics.

[15]  G. Bourque,et al.  Computational tools to unmask transposable elements , 2018, Nature Reviews Genetics.

[16]  M. Lynch,et al.  De novo identification of LTR retrotransposons in eukaryotic genomes , 2007, BMC Genomics.

[17]  Roger P Wise,et al.  TEnest: Automated Chronological Annotation and Visualization of Nested Plant Transposable Elements1[W][OA] , 2007, Plant Physiology.

[18]  György Abrusán,et al.  TEclass - a tool for automated classification of unknown eukaryotic transposable elements , 2009, Bioinform..

[19]  Susan R. Wessler,et al.  MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences , 2010, Nucleic acids research.

[20]  John F. McDonald,et al.  LTR_STRUC: a novel search and identification program for LTR retrotransposons , 2003, Bioinform..

[21]  Marcelo Helguera,et al.  MITE Tracker: an accurate approach to identify miniature inverted-repeat transposable elements in large genomes , 2018, BMC Bioinformatics.

[22]  H. Quesneville,et al.  PASTEC: An Automatic Transposable Element Classification Tool , 2014, PloS one.

[23]  Casey M. Bergman,et al.  Combined Evidence Annotation of Transposable Elements in Genome Sequences , 2005, PLoS Comput. Biol..

[24]  K. De Jong,et al.  Effective Automated Feature Construction and Selection for Classification of Biological Sequences , 2014, PloS one.

[25]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[26]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[27]  Jiaming Yin,et al.  Characterization and functional annotation of nested transposable elements in eukaryotic genomes. , 2012, Genomics.

[28]  Shujun Ou,et al.  LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons1[OPEN] , 2017, Plant Physiology.

[29]  SchmidhuberJürgen Deep learning in neural networks , 2015 .

[30]  X. Gu,et al.  TIR-Learner, a New Ensemble Method for TIR Transposable Element Annotation, Provides Evidence for Abundant New Transposable Elements in the Maize Genome. , 2019, Molecular plant.

[31]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[32]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[33]  Kenji Satou,et al.  DNA Sequence Classification by Convolutional Neural Network , 2016 .

[34]  T. Flutre,et al.  Considering Transposable Element Diversification in De Novo Annotation Approaches , 2011, PloS one.

[35]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[36]  D. Ray,et al.  Accurate Transposable Element Annotation Is Vital When Analyzing New Genome Assemblies , 2016, Genome biology and evolution.

[37]  Dawn H. Nagel,et al.  The B73 Maize Genome: Complexity, Diversity, and Dynamics , 2009, Science.

[38]  Antonino Fiannaca,et al.  Deep learning models for bacteria taxonomic classification of metagenomic data , 2018, BMC Bioinformatics.

[39]  O. Kohany,et al.  Repbase Update, a database of repetitive elements in eukaryotic genomes , 2015, Mobile DNA.

[40]  Xuequn Shang,et al.  MiteFinderII: a novel tool to identify miniature inverted-repeat transposable elements hidden in eukaryotic genomes , 2018, BMC Medical Genomics.

[41]  S. Kurtz,et al.  Fine-grained annotation and classification of de novo predicted LTR retrotransposons , 2009, Nucleic acids research.

[42]  Michael A. Beer,et al.  Discriminative prediction of mammalian enhancers from DNA sequence. , 2011, Genome research.

[43]  J. Bennetzen,et al.  A unified classification system for eukaryotic transposable elements , 2007, Nature Reviews Genetics.

[44]  Wojciech Makalowski,et al.  The human genome structure and organization. , 2001, Acta biochimica Polonica.

[45]  S. Brommonschenkel,et al.  Machine learning approaches and their current application in plant molecular biology: A systematic review. , 2019, Plant science : an international journal of experimental plant biology.