Kmer-SSR: a fast and exhaustive SSR search algorithm

Motivation: One of the main challenges with bioinformatics software is that the size and complexity of datasets necessitate trading speed for accuracy, or completeness. To combat this problem of computational complexity, a plethora of heuristic algorithms have arisen that report a ‘good enough’ solution to biological questions. However, in instances such as Simple Sequence Repeats (SSRs), a ‘good enough’ solution may not accurately portray results in population genetics, phylogenetics and forensics, which require accurate SSRs to calculate intra‐ and inter‐species interactions. Results: We present Kmer‐SSR, which finds all SSRs faster than most heuristic SSR identification algorithms in a parallelized, easy‐to‐use manner. The exhaustive Kmer‐SSR option has 100% precision and 100% recall and accurately identifies every SSR of any specified length. To identify more biologically pertinent SSRs, we also developed several filters that allow users to easily view a subset of SSRs based on user input. Kmer‐SSR, coupled with the filter options, accurately and intuitively identifies SSRs quickly and in a more user‐friendly manner than any other SSR identification algorithm. Availability and implementation: The source code is freely available on GitHub at https://github.com/ridgelab/Kmer‐SSR. Contact: perry.ridge@byu.edu

[1]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[2]  A. Jeffreys,et al.  Comparative sequence analysis of human minisatellites showing meiotic repeat instability. , 1999, Genome research.

[3]  Andrew J. Robinson,et al.  Simple sequence repeat marker loci discovery using SSR primer. , 2004, Bioinformatics.

[4]  A. Gori,et al.  Nontuberculous Mycobacteria in Noncystic Fibrosis Bronchiectasis , 2015, BioMed research international.

[5]  Pascal Hingamp,et al.  QDD version 3.1: a user‐friendly computer program for microsatellite selection and primer design revisited: experimental validation of variables determining genotyping success rate , 2014, Molecular ecology resources.

[6]  D. Bartholomeu,et al.  ProGeRF: Proteome and Genome Repeat Finder Utilizing a Fast Parallel Hash Function , 2015, BioMed research international.

[7]  Y. Kashi,et al.  Simple sequence repeats as a source of quantitative genetic variation. , 1997, Trends in genetics : TIG.

[8]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[9]  G. Gutman,et al.  Slipped-strand mispairing: a major mechanism for DNA sequence evolution. , 1987, Molecular biology and evolution.

[10]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[11]  Y. Kashi,et al.  Simple sequence repeats as advantageous mutators in evolution. , 2006, Trends in genetics : TIG.

[12]  Haifeng Jiang,et al.  Ranked Subsequence Matching in Time-Series Databases , 2007, VLDB.

[13]  Sara L. Zimmer,et al.  The Chlamydomonas Genome Reveals the Evolution of Key Animal and Plant Functions , 2007, Science.

[14]  Perry G. Ridge,et al.  SA-SSR: a suffix array-based algorithm for exhaustive and efficient SSR discovery in large genetic sequences , 2016, Bioinform..

[15]  William J. Clancey,et al.  Heuristic Classification , 1986, Artif. Intell..

[16]  L. Lipovich,et al.  Computational and experimental analysis of microsatellites in rice (Oryza sativa L.): frequency, length variation, transposon associations, and genetic marker potential. , 2001, Genome research.

[17]  M. V. Katti,et al.  Differential distribution of simple sequence repeats in eukaryotic genome sequences. , 2001, Molecular biology and evolution.

[18]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[19]  Mark P. Miller,et al.  SSR_pipeline: a bioinformatic infrastructure for identifying microsatellites from paired-end Illumina high-throughput DNA sequencing data. , 2013, The Journal of heredity.

[20]  Peng Lu,et al.  GMATo: A novel tool for the identification and analysis of microsatellites in large genomes , 2013, Bioinformation.