Annotating transposable elements in the genome using relational decision tree ensembles

Transposable elements (TEs) are DNA sequences that can change their location within the genome. They contribute to genetic diversity within and across species and their transposing mechanisms may also affect the functionality of genes. Accurate annotation of TEs is an important step towards understanding their effects on genes and their role in genome evolution. We introduce a framework for annotating TEs which is based on relational decision tree learning. It allows to naturally represent the structured data and biological processes involving TEs. Furthermore, it also allows the integration of background knowledge and benefits from the interpretability of decision trees. Preliminary experiments show that our method outperforms two state-of-the-art systems for TE annotation.

[1]  Nuno A. Fonseca,et al.  Boosting the Detection of Transposable Elements Using Machine Learning , 2013, PACBB.

[2]  Casey M. Bergman,et al.  Discovering and detecting transposable elements in genome sequences , 2007, Briefings Bioinform..

[3]  Stefan Kurtz,et al.  LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons , 2008, BMC Bioinformatics.

[4]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[5]  Saso Dzeroski,et al.  First order random forests: Learning relational classifiers with complex aggregates , 2006, Machine Learning.

[6]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[7]  S. Kurtz,et al.  Fine-grained annotation and classification of de novo predicted LTR retrotransposons , 2009, Nucleic acids research.

[8]  J. Bennetzen,et al.  A unified classification system for eukaryotic transposable elements , 2007, Nature Reviews Genetics.

[9]  Robert D. Finn,et al.  Dfam: a database of repetitive DNA based on profile hidden Markov models , 2012, Nucleic Acids Res..

[10]  Zhao Xu,et al.  LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons , 2007, Nucleic Acids Res..

[11]  John B. Anderson,et al.  CDD: a Conserved Domain Database for protein classification , 2004, Nucleic Acids Res..

[12]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.