Detecting short tandem repeats from genome data: opening the software black box

Short tandem repeats, specifically microsatellites, are widely used genetic markers, associated with human genetic diseases, and play an important role in various regulatory mechanisms and evolution. Despite their importance, much is yet unknown about their mutational dynamics. The increasing availability of genome data has led to several in silico studies of microsatellite evolution which have produced a vast range of algorithms and software for tandem repeat detection. Documentation of these tools is often sparse, or provided in a format that is impenetrable to most biologists without informatics background. This article introduces the major concepts behind repeat detecting software essential for informed tool selection. We reflect on issues such as parameter settings and program bias, as well as redundancy filtering and efficiency using examples from the currently available range of programs, to provide an integrated comparison and practical guide to microsatellite detecting programs.

[1]  Robert Kofler,et al.  SciRoKo: a new tool for whole genome microsatellite search and investigation , 2007, Bioinform..

[2]  John M. Hancock,et al.  SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences , 1994, Comput. Appl. Biosci..

[3]  J. Stavenhagen,et al.  Stability of a CTG/CAG trinucleotide repeat in yeast is dependent on its orientation in the genome , 1997, Molecular and cellular biology.

[4]  John M. Butler,et al.  STRBase: a short tandem repeat DNA database for the human identity testing community , 2001, Nucleic Acids Res..

[5]  J. Jurka,et al.  Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.

[6]  Eric Rivals,et al.  STAR: an algorithm to Search for Tandem Approximate Repeats , 2004, Bioinform..

[7]  John M. Hancock,et al.  Detecting cryptically simple protein sequences using the SIMPLE algorithm , 2002, Bioinform..

[8]  L. Lipovich,et al.  Computational and experimental analysis of microsatellites in rice (Oryza sativa L.): frequency, length variation, transposon associations, and genetic marker potential. , 2001, Genome research.

[9]  E. Nevo,et al.  Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review , 2002, Molecular ecology.

[10]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[11]  Bill Long,et al.  An exhaustive DNA micro-satellite map of the human genome using high performance computing. , 2003, Genomics.

[12]  Denis C Shields,et al.  Tools for the identification of variable and potentially variable tandem repeats , 2006, BMC Genomics.

[13]  Ju-Kyung Yu,et al.  Nonrandom distribution and frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and barley , 2005, BMC Genomics.

[14]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[15]  Y. Kashi,et al.  Simple sequence repeats as advantageous mutators in evolution. , 2006, Trends in genetics : TIG.

[16]  D. Carter,et al.  A comparison of the nature and abundance of microsatellites in 14 fungal genomes. , 2004, Fungal genetics and biology : FG & B.

[17]  David B. Goldstein,et al.  Microsatellites: Evolution and Applications , 1999 .

[18]  Sunil Archak,et al.  InSatDb: a microsatellite database of fully sequenced insect genomes , 2006, Nucleic Acids Res..

[19]  Günter Kahl,et al.  Mining microsatellites in eukaryotic genomes. , 2007, Trends in biotechnology.

[20]  S Rozen,et al.  Primer3 on the WWW for general users and for biologist programmers. , 2000, Methods in molecular biology.

[21]  Gad M. Landau,et al.  An Algorithm for Approximate Tandem Repeats , 2001, J. Comput. Biol..

[22]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[23]  L. Jin,et al.  The exact numbers of possible microsatellite motifs. , 1994, American journal of human genetics.

[24]  Alex van Belkum,et al.  Short-Sequence DNA Repeats in Prokaryotic Genomes , 1998, Microbiology and Molecular Biology Reviews.

[25]  Atul Grover,et al.  EuMicroSatdb: A database for microsatellites in the sequenced genomes of eukaryotes , 2007, BMC Genomics.

[26]  H. Garner,et al.  Molecular origins of rapid and continuous morphological evolution , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[27]  M. V. Katti,et al.  Differential distribution of simple sequence repeats in eukaryotic genome sequences. , 2001, Molecular biology and evolution.

[28]  Mireille Régnier,et al.  Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression , 2006, Bioinform..

[29]  M. Morgante,et al.  Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes , 2002, Nature Genetics.

[30]  H R Garner,et al.  Repeat polymorphisms within gene regions: phenotypic and evolutionary implications. , 2000, American journal of human genetics.

[31]  Dinesh Gupta,et al.  ProtRepeatsDB: a database of amino acid repeats in genomes , 2006, BMC Bioinformatics.

[32]  Christian Schlötterer,et al.  Two distinct modes of microsatellite mutation processes: evidence from the complete genomic sequences of nine species. , 2003, Genome research.

[33]  C. Wills,et al.  DNA microsatellites: agents of evolution? , 1999, Scientific American.

[34]  Angelika Merkel,et al.  Detecting Microsatellites in Genome Data: Variance in Definitions and Bioinformatic Approaches Cause Systematic Bias , 2008, Evolutionary bioinformatics online.

[35]  Eric Rivals,et al.  Detecting microsatellites within genomes: significant variation among algorithms , 2007, BMC Bioinformatics.

[36]  T. Boby,et al.  TRbase: a database relating tandem repeats to disease genes for the human genome , 2005, Bioinform..

[37]  Dan Geiger,et al.  Finding approximate tandem repeats in genomic sequences , 2004, RECOMB.

[38]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[39]  Richard J. Edwards,et al.  Tandem repeat copy-number variation in protein-coding regions of human genes , 2005, Genome Biology.

[40]  C. E. Pearson,et al.  Repeat instability: mechanisms of dynamic mutations , 2005, Nature Reviews Genetics.

[41]  R. Varshney,et al.  Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.) , 2003, Theoretical and Applied Genetics.

[42]  Filippo Aluffi-Pentini,et al.  STRING: finding tandem repeats in DNA sequences , 2003, Bioinform..

[43]  Hampapathalu A. Nagarajaram,et al.  Genome analysis IMEx : Imperfect Microsatellite Extractor , 2007 .

[44]  Guang R. Gao,et al.  TROLL-Tandem Repeat Occurrence Locator , 2002, Bioinform..

[45]  S. Tyekucheva,et al.  The genome-wide determinants of human and chimpanzee microsatellite evolution. , 2007, Genome research.

[46]  Gary Benson,et al.  TRDB—The Tandem Repeats Database , 2006, Nucleic Acids Res..

[47]  E. Nevo,et al.  Microsatellites within genes: structure, function, and evolution. , 2004, Molecular biology and evolution.

[48]  Kenneth A. Marx,et al.  Poly: a quantitative analysis tool for simple sequence repeat (SSR) tracts in DNA , 2003, BMC Bioinformatics.

[49]  Andrew R. Dalby,et al.  COPASAAR – A database for proteomic analysis of single amino acid repeats , 2005 .

[50]  Larry J Young,et al.  Microsatellite Instability Generates Diversity in Brain and Sociobehavioral Traits , 2005, Science.

[51]  Niclas Jareborg,et al.  Genome-wide prediction of human VNTRs. , 2005, Genomics.

[52]  Xi Li,et al.  SSRPrimer and SSR Taxonomy Tree: Biome SSR discovery , 2006, Nucleic Acids Res..

[53]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..