Large Scale Analysis of Small Repeats via Mining of the Human Genome

Small repetitive sequences, called tandem repeats,are abundant throughout the human genome,both in coding and in non-coding regions. Their role is still mostlyunknown, but at least 20 of those repetitive sequences have been related to neurodegenerative disorders. The mutational process that isthe basis of these disorders is not yet fully understood. Comprehendingthe origin, function and possible usefulness of the tandemrepeats, will require analysis of huge data from various sources.In this paper we attempt such a large scale analysis of short repeats.We describe and discuss the steps that are needed to be taken to performlarge scale genomic analysis. We define tandem repeats and comparethe results of repeat localization with genome annotations. We show that the degree of repetitiveness is different for the humanchromosomes. Chromosome 19 and 17 have more repeats per mega base pair than any of the other chromosomes, the Y chromosome has the least. We also demonstrate that some repeat motifs are much more common than others. Mono- and dinucleotide repeats are the most abundant, with A and AAC the mostcommon motifs, while CG is hardly present within the genome. Repeats with unit length three are underrepresented on the genome and repeats with unit length 9 are extremely rare.

[1]  T. Boby,et al.  TRbase: a database relating tandem repeats to disease genes for the human genome , 2005, Bioinform..

[2]  S. Mundlos,et al.  The other trinucleotide repeat: polyalanine expansion disorders. , 2005, Current opinion in genetics & development.

[3]  L. Singh,et al.  Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions , 2003, Genome Biology.

[4]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[5]  H. Paulson,et al.  The Role of Protein Composition in Specifying Nuclear Inclusion Formation in Polyglutamine Disease* , 2001, The Journal of Biological Chemistry.

[6]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[7]  S. Naylor,et al.  Myotonic Dystrophy Type 2 Caused by a CCTG Expansion in Intron 1 of ZNF9 , 2001, Science.

[8]  Christine Van Broeckhoven,et al.  Pathogenesis of polyglutamine disorders: aggregation revisited. , 2003, Human molecular genetics.

[9]  W. Grady Genomic instability and colon cancer , 2004, Cancer and Metastasis Reviews.

[10]  D. Hilton‐Jones,et al.  Clinical and molecular aspects of the myotonic dystrophies: A review , 2005, Muscle & nerve.

[11]  H. Zoghbi,et al.  Fourteen and counting: unraveling trinucleotide repeat diseases. , 2000, Human molecular genetics.

[12]  R I Richards,et al.  Mapping of DNA instability at the fragile X to a trinucleotide repeat sequence p(CCG)n , 1991, Science.