Repeat or not repeat?—Statistical validation of tandem repeat prediction in genomic sequences

Tandem repeats (TRs) represent one of the most prevalent features of genomic sequences. Due to their abundance and functional significance, a plethora of detection tools has been devised over the last two decades. Despite the longstanding interest, TR detection is still not resolved. Our large-scale tests reveal that current detectors produce different, often nonoverlapping inferences, reflecting characteristics of the underlying algorithms rather than the true distribution of TRs in genomic data. Our simulations show that the power of detecting TRs depends on the degree of their divergence, and repeat characteristics such as the length of the minimal repeat unit and their number in tandem. To reconcile the diverse predictions of current algorithms, we propose and evaluate several statistical criteria for measuring the quality of predicted repeat units. In particular, we propose a model-based phylogenetic classifier, entailing a maximum-likelihood estimation of the repeat divergence. Applied in conjunction with the state of the art detectors, our statistical classification scheme for inferred repeats allows to filter out false-positive predictions. Since different algorithms appear to specialize at predicting TRs with certain properties, we advise applying multiple detectors with subsequent filtering to obtain the most complete set of genuine repeats.

[1]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[2]  G Vergnaud,et al.  Complex recombination events at the hypermutable minisatellite CEB1 (D2S90). , 1994, The EMBO journal.

[3]  Andrey V Kajava,et al.  Tandem repeats in proteins: from sequence to structure. , 2012, Journal of structural biology.

[4]  G. Gonnet,et al.  ALF—A Simulation Framework for Genome Evolution , 2011, Molecular biology and evolution.

[5]  Angelika Merkel,et al.  Detecting short tandem repeats from genome data: opening the software black box , 2008, Briefings Bioinform..

[6]  G. Gonnet,et al.  Empirical and structural models for insertions and deletions in the divergent evolution of proteins. , 1993, Journal of molecular biology.

[7]  L. Dai,et al.  Recent progress in elucidating the structure, function and evolution of disease resistance genes in plants. , 2007, Journal of genetics and genomics = Yi chuan xue bao.

[8]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[9]  Robert Kofler,et al.  SciRoKo: a new tool for whole genome microsatellite search and investigation , 2007, Bioinform..

[10]  M. V. Katti,et al.  Amino acid repeat patterns in protein sequences: Their diversity and structural‐functional implications , 2000, Protein science : a publication of the Protein Society.

[11]  Liisa Holm,et al.  Rapid automatic detection and alignment of repeats in protein sequences , 2000, Proteins.

[12]  N. L. Johnson,et al.  Discrete Multivariate Distributions , 1998 .

[13]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[14]  O. Elemento,et al.  Reconstructing the duplication history of tandemly repeated genes. , 2002, Molecular biology and evolution.

[15]  M. Nei,et al.  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[16]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[17]  Johannes Söding,et al.  De novo identification of highly diverged protein repeats by probabilistic consistency , 2008, Bioinform..

[18]  C. Sunkel,et al.  Human Autoantibodies Reveal Titin as a Chromosomal Protein , 1998, The Journal of cell biology.

[19]  R. White,et al.  A highly polymorphic locus in human DNA. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Gary Benson,et al.  Tandem repeats over the edit distance , 2007, Bioinform..

[21]  Ju-Kyung Yu,et al.  Nonrandom distribution and frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and barley , 2005, BMC Genomics.

[22]  Jaap Heringa,et al.  Tracking repeats using significance and transitivity , 2004, ISMB/ECCB.

[23]  A. Kajava Structural diversity of leucine-rich repeat proteins. , 1998, Journal of molecular biology.

[24]  M. Hendy,et al.  NTRFinder: a software tool to find nested tandem repeats , 2010, Nucleic acids research.

[25]  Charles J. Corrado,et al.  The exact distribution of the maximum, minimum and the range of Multinomial/Dirichlet and Multivariate Hypergeometric frequencies , 2010, Stat. Comput..

[26]  Andrey V Kajava,et al.  Beta-structures in fibrous proteins. , 2006, Advances in protein chemistry.

[27]  G. Glazko,et al.  Evolution and diversification of lamprey antigen receptors: evidence for involvement of an AID-APOBEC family cytosine deaminase , 2007, Nature Immunology.

[28]  A. Jeffreys,et al.  Individual-specific ‘fingerprints’ of human DNA , 1985, Nature.

[29]  F. Ayala,et al.  Population structure and recent evolution of Plasmodium falciparum. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Takeharu Hayashi,et al.  Titin mutations as the molecular basis for dilated cardiomyopathy. , 2002, Biochemical and biophysical research communications.

[31]  L. Noël,et al.  Role of SGT1 in resistance protein accumulation in plant immunity , 2006, The EMBO journal.

[32]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[33]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[34]  Eric Rivals,et al.  Detecting microsatellites within genomes: significant variation among algorithms , 2007, BMC Bioinformatics.

[35]  Andrey V. Kajava,et al.  T-REKS: identification of Tandem REpeats in sequences with a K-meanS based algorithm , 2009, Bioinform..

[36]  C. Ponting,et al.  Protein repeats: structures, functions, and evolution. , 2001, Journal of structural biology.

[37]  Aaron M. Newman,et al.  XSTREAM: A practical algorithm for identification and architecture modeling of tandem repeats in protein sequences , 2007, BMC Bioinformatics.

[38]  Vincent Vandewalle,et al.  Statistical tests to compare motif count exceptionalities , 2007, BMC Bioinformatics.

[39]  M. Touchon,et al.  Genesis, effects and fates of repeats in prokaryotic genomes. , 2009, FEMS microbiology reviews.

[40]  O. Gascuel,et al.  An improved general amino acid replacement matrix. , 2008, Molecular biology and evolution.

[41]  Tanja Gernhard,et al.  The conditioned reconstructed process. , 2008, Journal of theoretical biology.

[42]  Andrey V Kajava,et al.  Structure, function, and amyloidogenesis of fungal prions: filament polymorphism and prion variants. , 2006, Advances in protein chemistry.

[43]  Daniel Rios,et al.  Ensembl 2011 , 2010, Nucleic Acids Res..

[44]  Eric Rivals,et al.  STAR: an algorithm to Search for Tandem Approximate Repeats , 2004, Bioinform..

[45]  T. Prolla,et al.  Destabilization of tracts of simple repetitive DNA in yeast by mutations affecting DNA mismatch repair , 1994, Nature.

[46]  D. Eisenberg,et al.  A census of protein repeats. , 1999, Journal of molecular biology.

[47]  S. Ganesh,et al.  Tandem repeats in human disorders: mechanisms and evolution. , 2008, Frontiers in bioscience : a journal and virtual library.

[48]  A. Vihola,et al.  The role of titin in muscular disorders , 2003, Annals of medicine.

[49]  David Eisenberg,et al.  Structural models of amyloid-like fibrils. , 2006, Advances in protein chemistry.

[50]  Steven A Benner,et al.  Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. , 2004, Journal of molecular biology.

[51]  D. Tautz,et al.  Slippage synthesis of simple sequence DNA. , 1992, Nucleic acids research.

[52]  M. Anisimova,et al.  Origin and Evolution of GALA-LRR, a New Member of the CC-LRR Subfamily: From Plants to Bacteria? , 2008, PloS one.

[53]  Haixu Tang,et al.  Next-generation sequencing technologies and fragment assembly algorithms. , 2012, Methods in molecular biology.

[54]  Dina Sokol,et al.  TRedD—A database for tandem repeats over the edit distance , 2010, Database J. Biol. Databases Curation.