A machine learning based framework to identify and classify long terminal repeat retrotransposons

Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-Learner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RepeatMasker, Censor and LtrDigest. In contrast to these methods, TE-Learner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-Learner’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE.

[1]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[2]  P. Oliveira,et al.  Protein-DNA interactions define the mechanistic aspects of circle formation and insertion reactions in IS2 transposition , 2012, Mobile DNA.

[3]  Nuno A. Fonseca,et al.  Boosting the Detection of Transposable Elements Using Machine Learning , 2013, PACBB.

[4]  Jerzy Jurka,et al.  Censor - a Program for Identification and Elimination of Repetitive Elements From DNA Sequences , 1996, Comput. Chem..

[5]  Hadi Quesneville,et al.  Detection of transposable elements by their compositional bias , 2004, BMC Bioinformatics.

[6]  Gary Benson,et al.  Inverted repeat structure of the human genome: the X-chromosome contains a preponderance of large, highly homologous inverted repeats that contain testes genes. , 2004, Genome research.

[7]  S. Kurtz,et al.  LTRsift: a graphical user interface for semi-automatic classification and postprocessing of de novo detected LTR retrotransposons , 2012, Mobile DNA.

[8]  S. Kurtz,et al.  Fine-grained annotation and classification of de novo predicted LTR retrotransposons , 2009, Nucleic acids research.

[9]  Christiam Camacho,et al.  BLAST+ Release Notes , 2016 .

[10]  Francois Sabot,et al.  LTRclassifier: A website for fast structural LTR retrotransposons classification in plants , 2016, Mobile genetic elements.

[11]  Eugen C. Buehler,et al.  Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana , 1999, Nature.

[12]  H. Quesneville,et al.  PASTEC: An Automatic Transposable Element Classification Tool , 2014, PloS one.

[13]  Anna Gambin,et al.  TIRfinder: A Web Tool for Mining Class II Transposons Carrying Terminal Inverted Repeats , 2013, Evolutionary Bioinformatics Online.

[14]  J. Bennetzen,et al.  A unified classification system for eukaryotic transposable elements , 2007, Nature Reviews Genetics.

[15]  Robert D. Finn,et al.  Dfam: a database of repetitive DNA based on profile hidden Markov models , 2012, Nucleic Acids Res..

[16]  Lior Pachter,et al.  Identification of transposable elements using multiple alignments of related genomes. , 2005, Genome research.

[17]  György Abrusán,et al.  TEclass - a tool for automated classification of unknown eukaryotic transposable elements , 2009, Bioinform..

[18]  John B. Anderson,et al.  CDD: a Conserved Domain Database for protein classification , 2004, Nucleic Acids Res..

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[20]  Zhao Xu,et al.  LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons , 2007, Nucleic Acids Res..

[21]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[22]  Yves Bigot,et al.  A survey of transposable element classification systems--a call for a fundamental update to meet the challenge of their diversity and complexity. , 2015, Molecular phylogenetics and evolution.

[23]  James K. M. Brown,et al.  Genome size reduction through illegitimate recombination counteracts genome expansion in Arabidopsis. , 2002, Genome research.

[24]  Stefan Kurtz,et al.  LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons , 2008, BMC Bioinformatics.

[25]  Saso Dzeroski,et al.  First order random forests: Learning relational classifiers with complex aggregates , 2006, Machine Learning.

[26]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[27]  Casey M. Bergman,et al.  Discovering and detecting transposable elements in genome sequences , 2007, Briefings Bioinform..

[28]  Michael Ashburner,et al.  Recurrent insertion and duplication generate networks of transposable element sequences in the Drosophila melanogaster genome , 2006, Genome Biology.

[29]  Nirmal Ranganathan,et al.  Exploring Repetitive DNA Landscapes Using REPCLASS, a Tool That Automates the Classification of Transposable Elements in Eukaryotic Genomes , 2009, Genome biology and evolution.

[30]  Hendrik Blockeel,et al.  Top-Down Induction of First Order Logical Decision Trees , 1998, AI Commun..

[31]  Luc De Raedt,et al.  Logical and relational learning , 2008, Cognitive Technologies.