Benchmarking computational tools for polymorphic transposable element detection

Transposable elements (TEs) are an important source of human genetic variation with demonstrable effects on phenotype. Recently, a number of computational methods for the detection of polymorphic TE (polyTE) insertion sites from next-generation sequence data have been developed. The use of such tools will become increasingly important as the pace of human genome sequencing accelerates. For this report, we performed a comparative benchmarking and validation analysis of polyTE detection tools in an effort to inform their selection and use by the TE research community. We analyzed a core set of seven tools with respect to ease of use and accessibility, polyTE detection performance and runtime parameters. An experimentally validated set of 893 human polyTE insertions was used for this purpose, along with a series of simulated data sets that allowed us to assess the impact of sequence coverage on tool performance. The recently developed tool MELT showed the best overall performance followed by Mobster and then RetroSeq. PolyTE detection tools can best detect Alu insertion events in the human genome with reduced reliability for L1 insertions and substantially lowered performance for SVA insertions. We also show evidence that different polyTE detection tools are complementary with respect to their ability to detect a complete set of insertion events. Accordingly, a combined approach, coupled with manual inspection of individual results, may yield the best overall performance. In addition to the benchmarking results, we also provide notes on tool installation and usage as well as suggestions for future polyTE detection algorithm development.

[1]  Renyi Liu,et al.  ITIS, a bioinformatics tool for accurate identification of transposon insertion sites using next-generation sequencing data , 2015, BMC Bioinformatics.

[2]  Dmitri A. Petrov,et al.  T-lex 2 : genotyping , frequency estimation and re-annotation of transposable elements using single or pooled next-generation sequencing data , 2015 .

[3]  J. Jurka,et al.  Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.

[4]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[5]  M. Batzer,et al.  Reading TE leaves: new approaches to the identification of transposable element insertions. , 2011, Genome research.

[6]  E. Ostertag,et al.  SVA elements are nonautonomous retrotransposons that cause disease in humans. , 2003, American journal of human genetics.

[7]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[8]  Jef D Boeke,et al.  High Frequency Retrotransposition in Cultured Mammalian Cells , 1996, Cell.

[9]  M. Batzer,et al.  Repetitive Elements May Comprise Over Two-Thirds of the Human Genome , 2011, PLoS genetics.

[10]  Thierry Heidmann,et al.  LINE-mediated retrotransposition of marked Alu sequences , 2003, Nature Genetics.

[11]  M. Batzer,et al.  Recently integrated Alu elements and human genomic diversity. , 2003, Molecular biology and evolution.

[12]  Zhiping Weng,et al.  TEMP: a computational method for analyzing transposable element polymorphism in populations , 2014, Nucleic acids research.

[13]  M. Batzer,et al.  Alu repeats and human disease. , 1999, Molecular genetics and metabolism.

[14]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[15]  M F Singer,et al.  LINE-1: a mammalian transposable element. , 1987, Biochimica et biophysica acta.

[16]  Kai Ye,et al.  Mobster: accurate detection of mobile element insertions in next generation sequencing data , 2014, Genome Biology.

[17]  C. Hutchison,et al.  Conservation throughout mammalia and extensive protein-encoding capacity of the highly repeated DNA long interspersed sequence one. , 1986, Journal of molecular biology.

[18]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[19]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[20]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[21]  Miriam K. Konkel,et al.  Tangram: a comprehensive toolbox for mobile element insertion detection , 2014, BMC Genomics.

[22]  Brian T. Lee,et al.  The UCSC Genome Browser database: 2015 update , 2014, Nucleic Acids Res..

[23]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[24]  C. Y. Yu,et al.  Structure and genetics of the partially duplicated gene RP located immediately upstream of the complement C4A and the C4B genes in the HLA class III region. Molecular cloning, exon-intron structure, composite retroposon, and breakpoint of gene duplication. , 1994, The Journal of biological chemistry.

[25]  J. V. Moran,et al.  Hot L1s account for the bulk of retrotransposition in the human population , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Casey M. Bergman,et al.  Discovering and detecting transposable elements in genome sequences , 2007, Briefings Bioinform..

[27]  I. K. Jordan,et al.  Transposable element polymorphisms recapitulate human evolution , 2015, Mobile DNA.

[28]  Maite G. Barrón,et al.  T-lex2: genotyping, frequency estimation and re-annotation of transposable elements using single or pooled next-generation sequencing data , 2014, bioRxiv.

[29]  M. Ono,et al.  A novel human nonviral retroposon derived from an endogenous retrovirus. , 1987, Nucleic acids research.

[30]  Julian Barwell,et al.  The dawn of genomic medicine: the role of the 100,000 Genomes Project in breast care management , 2016 .

[31]  E. Ullu,et al.  Alu sequences are processed 7SL RNA genes , 1984, Nature.

[32]  Jerilyn A. Walker,et al.  SVA elements: a hominid-specific retroposon family. , 2005, Journal of molecular biology.

[33]  S. Antonarakis,et al.  Haemophilia A resulting from de novo insertion of L1 sequences represents a novel mechanism for mutation in man , 1988, Nature.

[34]  Thomas M. Keane,et al.  RetroSeq: transposable element discovery from next-generation sequencing data , 2013, Bioinform..

[35]  J. V. Moran,et al.  LINE-1 elements in structural variation and disease. , 2011, Annual review of genomics and human genetics.

[36]  Carl W. Schmid,et al.  Sequence organization of the human genome , 1975, Cell.

[37]  D. C. Hancks,et al.  Active human retrotransposons: variation and disease. , 2012, Current opinion in genetics & development.

[38]  A. Ewing Transposable element detection from whole genome sequence data , 2015, Mobile DNA.

[39]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[40]  O. Kohany,et al.  Repbase Update, a database of repetitive elements in eukaryotic genomes , 2015, Mobile DNA.

[41]  M. Batzer,et al.  A human-specific subfamily of Alu sequences. , 1991, Genomics.

[42]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .