Clustering Tandem Repeats via Trinucleotides

Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagnostic tools. Computational programs that discover these tandem repeats generate a huge volume of data, which is often difficult to decipher without further organization. In this paper, we describe a new method for post-processing tandem repeats through clustering. Our work presents multiple ways of expressing tandem repeats using the n-gram model with different clustering distance measures. Analysis of these clusters for chromosome 1 of the human genomes shows that the clustering of tandem repeats according to 3-grams yields well-defined clusters. Our new, alignment-free method facilitates the analysis of the myriad of tandem repeats that occur in the human genome and we believe that this work will lead to new discoveries on the roles, origins, and significance of tandem repeats.

[1]  H. Garner,et al.  Global microsatellite content distinguishes humans, primates, animals, and plants. , 2009, Molecular biology and evolution.

[2]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[3]  Wlodzimierz J. Krzyzosiak,et al.  Trinucleotide repeats in human genome and exome , 2010, Nucleic acids research.

[4]  Huda Y. Zoghbi,et al.  Diseases of Unstable Repeat Expansion: Mechanisms and Common Principles , 2005, Nature Reviews Genetics.

[5]  J. Jeffreys William Allan Award Address Alec , 2006 .

[6]  Gary Benson,et al.  Evaluating distance functions for clustering tandem repeats. , 2005, Genome informatics. International Conference on Genome Informatics.

[7]  A J Jeffreys,et al.  1992 William Allan Award address. , 1993, American journal of human genetics.

[8]  Mark A. Ragan,et al.  A visual framework for sequence analysis using n-grams and spectral rearrangement , 2010, Bioinform..

[9]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[10]  Laurent Mouchard,et al.  Speeding up the detection of evolutive tandem repeats , 2004, Theor. Comput. Sci..

[11]  Rui Xu,et al.  Clustering Algorithms in Biomedical Research: A Review , 2010, IEEE Reviews in Biomedical Engineering.

[12]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[13]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[14]  Dina Sokol,et al.  TRedD—A database for tandem repeats over the edit distance , 2010, Database J. Biol. Databases Curation.

[15]  Gary Benson,et al.  Tandem repeats over the edit distance , 2007, Bioinform..

[16]  Zeev Volkovich,et al.  The method of N-grams in large-scale clustering of DNA texts , 2005, Pattern Recognit..

[17]  J. Minna,et al.  Sporadic breast cancer patients' germline DNA exhibit an AT‐rich microsatellite signature , 2011, Genes, chromosomes & cancer.

[18]  John McCarthy,et al.  Mathematical Theory of Computation , 1991 .

[19]  Dan Geiger,et al.  Finding approximate tandem repeats in genomic sequences , 2004, RECOMB.

[20]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[21]  S. Mirkin,et al.  DNA structures, repeat expansions and human hereditary disorders. , 2006, Current opinion in structural biology.

[22]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[23]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[24]  Gary Benson,et al.  A new distance measure for comparing sequence profiles based on path lengths along an entropy surface , 2002, ECCB.

[25]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[26]  Gary Benson,et al.  TRDB—The Tandem Repeats Database , 2006, Nucleic Acids Res..

[27]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[28]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[29]  Alessio Vecchio,et al.  Tandem repeats discovery service (TReaDS) applied to finding novel cis-acting factors in repeat expansion diseases , 2012, BMC Bioinformatics.

[30]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .