Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases

Abstract The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.

[1]  E. Blackburn,et al.  A tandemly repeated sequence at the termini of the extrachromosomal ribosomal RNA genes in Tetrahymena. , 1978, Journal of molecular biology.

[2]  R. Ferone,et al.  Dihydrofolate reductase: thymidylate synthase, a bifunctional polypeptide from Crithidia fasciculata. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[3]  P. Ferrara,et al.  Nucleotide sequence of the metL gene of Escherichia coli. Its product, the bifunctional aspartokinase ii-homoserine dehydrogenase II, and the bifunctional product of the thrA gene, aspartokinase I-homoserine dehydrogenase I, derive from a common ancestor. , 1983, The Journal of biological chemistry.

[4]  Swee Lay Thein,et al.  Hypervariable ‘minisatellite’ regions in human DNA , 1985, Nature.

[5]  M. Litt,et al.  A hypervariable microsatellite revealed by in vitro amplification of a dinucleotide repeat within the cardiac muscle actin gene. , 1989, American journal of human genetics.

[6]  R I Richards,et al.  Simple tandem DNA repeats and human genetic disease. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[7]  A. Devries,et al.  Convergent evolution of antifreeze glycoproteins in Antarctic notothenioid fish and Arctic cod. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[8]  A. Devries,et al.  Evolution of antifreeze glycoprotein gene from a trypsinogen gene in Antarctic notothenioid fish. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[9]  J Heringa,et al.  Detection of internal repeats: how common are they? , 1998, Current opinion in structural biology.

[10]  D. Eisenberg,et al.  A census of protein repeats. , 1999, Journal of molecular biology.

[11]  G. Lindahl,et al.  The R28 protein of Streptococcus pyogenes is related to several group B streptococcal surface proteins, confers protective immunity and promotes binding to human epithelial cells , 1999, Molecular microbiology.

[12]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[13]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[14]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[15]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[16]  G Vergnaud,et al.  Minisatellites: mutability and genome architecture. , 2000, Genome research.

[17]  C. Ponting,et al.  Homology-based method for identification of protein repeats using statistical significance estimates. , 2000, Journal of molecular biology.

[18]  J. Jurka Repbase update: a database and an electronic journal of repetitive elements. , 2000, Trends in genetics : TIG.

[19]  John M. Butler,et al.  STRBase: a short tandem repeat DNA database for the human identity testing community , 2001, Nucleic Acids Res..

[20]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[21]  P. Tompa Intrinsically unstructured proteins evolve by repeat expansion , 2003, BioEssays : news and reviews in molecular, cellular and developmental biology.

[22]  Livia Visai,et al.  Characterization of novel LPXTG-containing proteins of Staphylococcus aureus identified from genome sequences. , 2003, Microbiology.

[23]  Matthew Hurles,et al.  Gene Duplication: The Genomic Trade in Spare Parts , 2004, PLoS biology.

[24]  Aleksandar Milosavljevic,et al.  Prototypic sequences for human repetitive DNA , 1992, Journal of Molecular Evolution.

[25]  M. G. Kidwell,et al.  Transposable elements and the evolution of genome size in eukaryotes , 2002, Genetica.

[26]  Fran Lewitter,et al.  Intragenic tandem repeats generate functional variability , 2005, Nature Genetics.

[27]  H. Riethman,et al.  Human subtelomere structure and variation , 2005, Chromosome Research.

[28]  M. Borodovsky,et al.  Gene identification in novel eukaryotic genomes by self-training algorithm , 2005, Nucleic acids research.

[29]  Y. Kashi,et al.  Simple sequence repeats as advantageous mutators in evolution. , 2006, Trends in genetics : TIG.

[30]  Gary Benson,et al.  TRDB—The Tandem Repeats Database , 2006, Nucleic Acids Res..

[31]  Casey M. Bergman,et al.  Discovering and detecting transposable elements in genome sequences , 2007, Briefings Bioinform..

[32]  M. Cáccamo,et al.  Conservation and divergence of gene families encoding components of innate immune response systems in zebrafish , 2007, Genome Biology.

[33]  Jonathan E. Allen,et al.  Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments , 2007, Genome Biology.

[34]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[35]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[36]  M. Anisimova,et al.  Origin and Evolution of GALA-LRR, a New Member of the CC-LRR Subfamily: From Plants to Bacteria? , 2008, PloS one.

[37]  David Haussler,et al.  Using native and syntenically mapped cDNA alignments to improve de novo gene finding , 2008, Bioinform..

[38]  Christoph Mayer,et al.  Genome-wide analysis of tandem repeats in Daphnia pulex - a comparative approach , 2010, BMC Genomics.

[39]  D. S. Reiner,et al.  Draft Genome Sequencing of Giardia intestinalis Assemblage B Isolate GS: Is Human Giardiasis Caused by Two Different Species? , 2009, PLoS pathogens.

[40]  John M. Hancock,et al.  Tandem and cryptic amino acid repeats accumulate in disordered regions of proteins , 2009, Genome Biology.

[41]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[42]  A. Futschik,et al.  The Next Generation of Molecular Markers From Massively Parallel Sequencing of Pooled DNA Samples , 2010, Genetics.

[43]  Inge Jonassen,et al.  Characteristics of 454 pyrosequencing data—enabling realistic simulation with flowsim , 2010, Bioinform..

[44]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[45]  Loris Mularoni,et al.  Natural selection drives the accumulation of amino acid tandem repeats in human proteins. , 2010, Genome research.

[46]  Mark Akeson,et al.  Replication of Individual DNA Molecules under Electronic Control Using a Protein Nanopore , 2010, Nature nanotechnology.

[47]  Seth Debolt,et al.  Copy Number Variation Shapes Genome Diversity in Arabidopsis Over Immediate Family Generational Scales , 2010, Genome biology and evolution.

[48]  Andrey V Kajava,et al.  Protein homorepeats sequences, structures, evolution, and functions. , 2010, Advances in Protein Chemistry and Structural Biology.

[49]  Bin Xue,et al.  Protein tandem repeats – the more perfect, the less structured , 2010, The FEBS journal.

[50]  Meenakshi Agarwal,et al.  Centromere identity: a challenge to be faced , 2010, Molecular Genetics and Genomics.

[51]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[52]  E. Szarka,et al.  Reassessing Domain Architecture Evolution of Metazoan Proteins: Major Impact of Gene Prediction Errors , 2011, Genes.

[53]  Xiaomin Zhao,et al.  ALS51, a newly discovered gene in the Candida albicans ALS family, created by intergenic recombination: analysis of the gene and protein, and implications for evolution of microbial gene families. , 2011, FEMS immunology and medical microbiology.

[54]  Mark Yandell,et al.  MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects , 2011, BMC Bioinformatics.

[55]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[56]  Inge Jonassen,et al.  The genome sequence of Atlantic cod reveals a unique immune system , 2011, Nature.

[57]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[58]  Xuan Zhuang,et al.  Protein genes in repetitive sequence—antifreeze glycoproteins in Atlantic cod genome , 2012, BMC Genomics.

[59]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[60]  M. Yandell,et al.  A beginner's guide to eukaryotic genome annotation , 2012, Nature Reviews Genetics.

[61]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[62]  Alain Hauser,et al.  Repeat or not repeat?—Statistical validation of tandem repeat prediction in genomic sequences , 2012, Nucleic acids research.

[63]  N. Kyrpides,et al.  Direct Comparisons of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample , 2012, PloS one.

[64]  M. Albà,et al.  Dissecting the role of low-complexity regions in the evolution of vertebrate proteins , 2012, BMC Evolutionary Biology.

[65]  J. Whitney,et al.  Re-Evaluation of a Bacterial Antifreeze Protein as an Adhesin with Ice-Binding Activity , 2012, PloS one.

[66]  Andrey V Kajava,et al.  Tandem repeats in proteins: from sequence to structure. , 2012, Journal of structural biology.

[67]  M. Kasahara,et al.  VLR-based adaptive immunity. , 2012, Annual review of immunology.

[68]  R. Hardison Evolution of hemoglobin and its genes. , 2012, Cold Spring Harbor perspectives in medicine.

[69]  Maria Anisimova,et al.  Graph-based modeling of tandem repeats improves global multiple sequence alignment , 2013, Nucleic acids research.

[70]  Sabyasachi Das,et al.  Organization of lamprey variable lymphocyte receptor C locus and repertoire development , 2013, Proceedings of the National Academy of Sciences.

[71]  Philip Hugenholtz,et al.  Shining a Light on Dark Sequencing: Characterising Errors in Ion Torrent PGM Data , 2013, PLoS Comput. Biol..

[72]  C. Liang,et al.  Genome-Wide Analysis of Tandem Repeats in Plants and Green Algae , 2013, G3: Genes, Genomes, Genetics.

[73]  A. Grove,et al.  C-terminal low-complexity sequence repeats of Mycobacterium smegmatis Ku modulate DNA binding , 2012, Bioscience reports.

[74]  F. Hoffmann,et al.  Whole-Genome Duplication and the Functional Diversification of Teleost Fish Hemoglobins , 2012, Molecular biology and evolution.

[75]  Carolyn J. Lawrence-Dill,et al.  MAKER-P: A Tool Kit for the Rapid Creation, Management, and Quality Control of Plant Genome Annotations1[W][OPEN] , 2013, Plant Physiology.

[76]  Mengmeng Huang,et al.  PCR amplification of repetitive DNA: a limitation to genome editing technologies and many other applications , 2014, Scientific Reports.

[77]  O. Gascuel,et al.  Deep Conservation of Human Protein Tandem Repeats within the Eukaryotes , 2014, Molecular biology and evolution.

[78]  Qi Li,et al.  Genome-Wide Analysis of Simple Sequence Repeats in Marine Animals—a Comparative Approach , 2014, Marine Biotechnology.

[79]  A. Aertsen,et al.  The role of variable DNA tandem repeats in bacterial adaptation. , 2014, FEMS microbiology reviews.

[80]  Michail Yu. Lobanov,et al.  HRaP: database of occurrence of HomoRepeats and patterns in proteomes , 2013, Nucleic Acids Res..

[81]  Matthew Fraser,et al.  InterProScan 5: genome-scale protein function classification , 2014, Bioinform..

[82]  Kin-Fan Au,et al.  PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[83]  Floriane Plard,et al.  Comparative Analysis of Transposable Elements Highlights Mobilome Diversity and Evolution in Vertebrates , 2015, Genome biology and evolution.

[84]  Katharina J. Hoff,et al.  Current methods for automated annotation of protein-coding genes. , 2015, Current opinion in insect science.

[85]  Marco Pellegrini,et al.  Tandem Repeats in Proteins: Prediction Algorithms and Biological Role , 2015, Front. Bioeng. Biotechnol..

[86]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[87]  Tyler A. Elliott,et al.  What's in a genome? The C-value enigma and the evolution of eukaryotic genome content , 2015, Philosophical Transactions of the Royal Society B: Biological Sciences.

[88]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[89]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[90]  Maria Anisimova,et al.  Statistical Approaches to Detecting and Analyzing Tandem Repeats in Genomic Sequences , 2015, Front. Bioeng. Biotechnol..

[91]  Christos A. Ouzounis,et al.  Annotation inconsistencies beyond sequence similarity-based function prediction – phylogeny and genome structure , 2015, Standards in Genomic Sciences.

[92]  M. Anisimova,et al.  The evolution and function of protein tandem repeats in plants. , 2015, The New phytologist.

[93]  Ioannis Xenarios,et al.  TRAL: tandem repeat annotation library , 2015, Bioinform..

[94]  James R. Knight,et al.  An improved genome assembly uncovers prolific tandem repeats in Atlantic cod , 2016, bioRxiv.

[95]  P. Trosvik,et al.  Microsatellite Length Scoring by Single Molecule Real Time Sequencing – Effects of Sequence Structure and PCR Regime , 2016, PloS one.

[96]  I. Inoue,et al.  Structure and evolution of the filaggrin gene repeated region in primates , 2017, BMC Evolutionary Biology.

[97]  Daniel J. Gaffney,et al.  A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[98]  I. Bradbury,et al.  Preferential amplification of repetitive DNA during whole genome sequencing library creation from historic samples , 2016 .

[99]  M. Gonzalez-Garay Introduction to Isoform Sequencing Using Pacific Biosciences Technology (Iso-Seq) , 2016 .

[100]  Drew R. Schield,et al.  Microsatellite landscape evolutionary dynamics across 450 million years of vertebrate genome evolution. , 2016, Genome.

[101]  Philipp H. Schiffer,et al.  Structure and evolutionary history of a large family of NLR proteins in the zebrafish , 2015, bioRxiv.

[102]  Jeffrey T Leek,et al.  Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown , 2016, Nature Protocols.

[103]  H. Pamjav,et al.  A study of the Bodrogköz population in north-eastern Hungary by Y chromosomal haplotypes and haplogroups , 2017, Molecular Genetics and Genomics.

[104]  Srikrishna Subramanian,et al.  Complete genome sequence and comparative genomics of the probiotic yeast Saccharomyces boulardii , 2017, Scientific Reports.

[105]  D. Ray,et al.  Evolution and Diversity of Transposable Elements in Vertebrate Genomes , 2017, Genome biology and evolution.

[106]  Nicholas W. VanKuren,et al.  Hidden genetic variation shapes the structure of functional elements in Drosophila , 2017, Nature Genetics.

[107]  Melissa Gymrek,et al.  A genomic view of short tandem repeats. , 2017, Current opinion in genetics & development.

[108]  Mark Yandell,et al.  The sea lamprey germline genome provides insights into programmed genome rearrangement and vertebrate evolution , 2018, Nature Genetics.

[109]  W. Yang,et al.  Sequence-based diversity of 23 autosomal STR loci in Koreans investigated using an in-house massively parallel sequencing panel. , 2017, Forensic science international. Genetics.

[110]  A. Ossowski,et al.  Genetic variation of 15 autosomal STRs in a population sample of Bedouins residing in the area of the Fourth Nile Cataract, Sudan. , 2017, Anthropologischer Anzeiger; Bericht uber die biologisch-anthropologische Literatur.

[111]  K. Jakobsen,et al.  Evolution of Hemoglobin Genes in Codfishes Influenced by Ocean Depth , 2017, Scientific Reports.

[112]  K. Jakobsen,et al.  De Novo Gene Evolution of Antifreeze Glycoproteins in Codfishes Revealed by Whole Genome Sequence Data , 2017, Molecular biology and evolution.

[113]  I. Voets,et al.  Structure of a 1.5-MDa adhesin that binds its Antarctic bacterium to diatoms and ice , 2017, Science Advances.

[114]  A. Pang,et al.  Combination of short-read, long-read, and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications , 2017, Genome research.

[115]  Paolo Piazza,et al.  Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis , 2017, F1000Research.

[116]  Matheus Eloy Franco,et al.  In silico characterization of tandem repeats in Trichophyton rubrum and related dermatophytes provides new insights into their role in pathogenesis , 2017, Database J. Biol. Databases Curation.

[117]  Silvio C. E. Tosatto,et al.  RepeatsDB 2.0: improved annotation, classification, search and visualization of repeat protein structures , 2017, Nucleic Acids Res..

[118]  Miguel A. Andrade-Navarro,et al.  dAPE: a web server to detect homorepeats and follow their evolution , 2016, Bioinform..

[119]  J. Akey,et al.  Massive variation of short tandem repeats with functional consequences across strains of Arabidopsis thaliana , 2017, bioRxiv.

[120]  Ayelet T. Lamm,et al.  QsRNA-seq: a method for high-throughput profiling and quantifying small RNAs , 2018, Genome Biology.

[121]  B. Larue,et al.  Nuclear, chloroplast, and mitochondrial data of a US cannabis DNA database , 2018, International Journal of Legal Medicine.

[122]  D. Linke,et al.  The repeat structure of two paralogous genes, Yersinia ruckeri invasin (yrInv) and a "Y. ruckeri invasin-like molecule", (yrIlm) sheds light on the evolution of adhesive capacities of a fish pathogen. , 2017, Journal of structural biology.

[123]  F. Denoeud,et al.  Chromosome-scale assemblies of plant genomes using nanopore long reads and optical maps , 2018, Nature Plants.

[124]  A. Nederbragt,et al.  Genomic architecture of haddock (Melanogrammus aeglefinus) shows expansions of innate immune genes and short tandem repeats , 2018, BMC Genomics.

[125]  Atif Adnan,et al.  Population data and phylogenetic structure of Han population from Jiangsu province of China on GlobalFiler STR loci , 2018, International Journal of Legal Medicine.

[126]  Ralph Schlapbach,et al.  Pushing the limits of de novo genome assembly for complex prokaryotic genomes harboring very long, near identical repeats , 2018, bioRxiv.

[127]  N. Morling,et al.  The Danish STR sequence database: duplicate typing of 363 Danes with the ForenSeq™ DNA Signature Prep Kit , 2018, International Journal of Legal Medicine.

[128]  Pablo Mier,et al.  Glutamine Codon Usage and polyQ Evolution in Primates Depend on the Q Stretch Length , 2018, Genome biology and evolution.

[129]  Juan Carlos Castilla-Rubio,et al.  Earth BioGenome Project: Sequencing life for the future of life , 2018, Proceedings of the National Academy of Sciences.

[130]  Alexandre Souvorov,et al.  SKESA: strategic k-mer extension for scrupulous assemblies , 2018, Genome Biology.

[131]  J. Bennetzen,et al.  Comparative genome-wide characterization leading to simple sequence repeat marker development for Nicotiana , 2018, BMC Genomics.

[132]  Dennis A. Benson,et al.  GenBank , 2017, Nucleic Acids Res..

[133]  M Thomas P Gilbert,et al.  Bat Biology, Genomes, and the Bat1K Project: To Generate Chromosome-Level Genomes for All Living Bat Species. , 2018, Annual review of animal biosciences.

[134]  R. Kretsinger,et al.  Leucine Rich Repeat Proteins: Sequences, Mutations, Structures and Diseases. , 2019, Protein and peptide letters.

[135]  Sergey Koren,et al.  Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome , 2019, Nature Biotechnology.

[136]  Mick Watson,et al.  Errors in long-read assemblies can critically affect protein prediction , 2019, Nature Biotechnology.

[137]  Sergey Koren,et al.  Reply to ‘Errors in long-read assemblies can critically affect protein prediction’ , 2019, Nature Biotechnology.

[138]  Silvio C. E. Tosatto,et al.  Disentangling the complexity of low complexity proteins , 2019, Briefings Bioinform..