Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm

The main feature of global repeat map (GRM) algorithm (www.hazu.hr/grm/software/win/grm2012.exe) is its ability to identify a broad variety of repeats of unbounded length that can be arbitrarily distant in sequences as large as human chromosomes. The efficacy is due to the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram. In this way, we obtain very fast, efficient and highly automatized repeat finding tool. The method is robust to substitutions and insertions/deletions, as well as to various complexities of the sequence pattern. We present several case studies of GRM use, in order to illustrate its capabilities: identification of α-satellite tandem repeats and higher order repeats (HORs), identification of Alu dispersed repeats and of Alu tandems, identification of Period 3 pattern in exons, implementation of ‘magnifying glass’ effect, identification of complex HOR pattern, identification of inter-tandem transitional dispersed repeat sequences and identification of long segmental duplications. GRM algorithm is convenient for use, in particular, in cases of large repeat units, of highly mutated and/or complex repeats, and of global repeat maps for large genomic sequences (chromosomes and genomes).

[1]  Dimitris Anastassiou,et al.  Frequency-domain analysis of biomolecular sequences , 2000, Bioinform..

[2]  Christoph Mayer,et al.  Genome-wide analysis of tandem repeats in Daphnia pulex - a comparative approach , 2010, BMC Genomics.

[3]  Gary Benson A Space Efficient Algorithm for Finding the Best Nonoverlapping Alignment Score , 1995, Theor. Comput. Sci..

[4]  Darren C. Ames,et al.  Comparative Analyses of Human Single- and Multilocus Tandem Repeats , 2008, Genetics.

[5]  Wolfgang Stephan,et al.  The evolutionary dynamics of repetitive DNA in eukaryotes , 1994, Nature.

[6]  Eric Rivals,et al.  STAR: an algorithm to Search for Tandem Approximate Repeats , 2004, Bioinform..

[7]  G. Tong Zhou,et al.  Techniques for detecting approximate tandem repeats in DNA , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  P. Deininger,et al.  Diverse cis factors controlling Alu retrotransposition: what causes Alu elements to die? , 2009, Genome research.

[9]  J. Mattick,et al.  Long non-coding RNAs: insights into functions , 2009, Nature Reviews Genetics.

[10]  Wei Wang,et al.  Computing linear transforms of symbolic signals , 2002, IEEE Trans. Signal Process..

[11]  G Vergnaud,et al.  Minisatellites: mutability and genome architecture. , 2000, Genome research.

[12]  Lisa Deininger,et al.  Active Alu element "A-tails": size does matter. , 2002, Genome research.

[13]  Eric Rivals,et al.  Detecting microsatellites within genomes: significant variation among algorithms , 2007, BMC Bioinformatics.

[14]  P D Cristea Conversion of nucleotides sequences into genomic signals , 2002, Journal of cellular and molecular medicine.

[15]  Steven Henikoff,et al.  Expansions of transgene repeats cause heterochromatin formation and gene silencing in Drosophila , 1994, Cell.

[16]  Filippo Aluffi-Pentini,et al.  STRING: finding tandem repeats in DNA sequences , 2003, Bioinform..

[17]  A. Nandy,et al.  Novel techniques of graphical representation and analysis of DNA sequences—A review , 1998, Journal of Biosciences.

[18]  Wentian Li,et al.  Long-range correlation and partial 1/fα spectrum in a noncoding DNA sequence , 1992 .

[19]  V. Chechetkin,et al.  Search of hidden periodicities in DNA sequences. , 1995, Journal of theoretical biology.

[20]  Yusuke Nakamura,et al.  VNTR (variable number of tandem repeat) sequences as transcriptional, translational, or functional regulators , 1998, Journal of Human Genetics.

[21]  B. Chadwick,et al.  Characterization of DXZ4 conservation in primates implies important functional roles for CTCF binding, array expression and tandem repeat organization on the X chromosome , 2011, Genome Biology.

[22]  J. Monod,et al.  Genetic regulatory mechanisms in the synthesis of proteins. , 1961, Journal of molecular biology.

[23]  Axel Visel,et al.  Functional autonomy of distant-acting human enhancers. , 2009, Genomics.

[24]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[25]  C H Waddington,et al.  Gene regulation in higher cells. , 1969, Science.

[26]  W. Miller,et al.  Finding cis-regulatory elements using comparative genomics: some lessons from ENCODE data. , 2007, Genome research.

[27]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[28]  V. R. Chechetkin,et al.  Spectral sum rules and search for periodicities in DNA sequences , 2011, 1104.0541.

[29]  P. Warburton,et al.  Analysis of the largest tandemly repeated DNA families in the human genome , 2008, BMC Genomics.

[30]  Eugene W. Myers,et al.  Identifying Satellites and Periodic Repetitions in Biological Sequences , 1998, J. Comput. Biol..

[31]  S. Wessler Transposable elements and the evolution of gene expression. , 1998, Symposia of the Society for Experimental Biology.

[32]  A. McCallion,et al.  Genomics of long-range regulatory elements. , 2010, Annual review of genomics and human genetics.

[33]  I. Grosse,et al.  MEASURING CORRELATIONS IN SYMBOL SEQUENCES , 1995 .

[34]  B. Chadwick,et al.  Variation in Array Size, Monomer Composition and Expression of the Macrosatellite DXZ4 , 2011, PloS one.

[35]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[36]  Derek Abbott,et al.  Review of signal processing in genetics , 2005 .

[37]  Gary Benson,et al.  TRDB—The Tandem Repeats Database , 2006, Nucleic Acids Res..

[38]  Vladimir Paar,et al.  ColorHOR-novel graphical algorithm for fast scan of alpha satellite higher-order repeats and HOR annotation for GenBank sequence of human genome , 2005, Bioinform..

[39]  S. Basak,et al.  Mathematical descriptors of DNA sequences: development and applications , 2006 .

[40]  P.D. Cristea,et al.  Genomic signal processing , 2004, 7th Seminar on Neural Network Applications in Electrical Engineering, 2004. NEUREL 2004. 2004.

[41]  R. Britten,et al.  Repetitive and Non-Repetitive DNA Sequences and a Speculation on the Origins of Evolutionary Novelty , 1971, The Quarterly Review of Biology.

[42]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[43]  Shigehiko Kanaya,et al.  Periodicity in prokaryotic and eukaryotic genomes identified by power spectrum analysis. , 2002, Gene.

[44]  C. Tyler-Smith,et al.  Structure of the major block of alphoid satellite DNA on the human Y chromosome. , 1987, Journal of molecular biology.

[45]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[46]  Jamal Tuqan,et al.  Gene Identification Using the Z-Curve Representation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[47]  Liming Wang,et al.  Mapping Equivalence for Symbolic Sequences: Theory and Applications , 2009, IEEE Transactions on Signal Processing.

[48]  R. Britten,et al.  Gene regulation for higher cells: a theory. , 1969, Science.

[49]  L. Pennacchio,et al.  Genomic strategies to identify mammalian regulatory sequences , 2001, Nature Reviews Genetics.

[50]  Kuldip Singh,et al.  A Novel Signal Processing Measure to Identify Exact and Inexact Tandem Repeat Patterns in DNA Sequences , 2007, EURASIP J. Bioinform. Syst. Biol..

[51]  E. Eichler,et al.  Primate segmental duplications: crucibles of evolution, diversity and disease , 2006, Nature Reviews Genetics.

[52]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[53]  P. P. Vaidyanathan,et al.  The role of signal-processing concepts in genomics and proteomics , 2004, J. Frankl. Inst..

[54]  Gary Benson,et al.  Tandem repeats over the edit distance , 2007, Bioinform..

[55]  T. Strachan,et al.  HUMAN GENOME EVOLUTION , 2004 .

[56]  D. Tautz,et al.  Cryptic simplicity in DNA is a major source of genetic variation , 1986, Nature.

[57]  Günter Kahl,et al.  Mining microsatellites in eukaryotic genomes. , 2007, Trends in biotechnology.

[58]  B. Chadwick,et al.  Expression, tandem repeat copy number variation and stability of four macrosatellite arrays in the human genome , 2010, BMC Genomics.

[59]  Gajendra P. S. Raghava,et al.  Spectral Repeat Finder (SRF): identification of repetitive sequences using Fourier transformation , 2004, Bioinform..

[60]  Hon Keung Kwan,et al.  Graphical representation of DNA sequences , 2009, 2009 IEEE International Conference on Electro/Information Technology.

[61]  Matthieu Legendre,et al.  Variable tandem repeats accelerate evolution of coding and regulatory sequences. , 2010, Annual review of genetics.

[62]  S. Bridges,et al.  Computational Approaches and Tools Used in Identification of Dispersed Repetitive DNA Sequences , 2008, Tropical Plant Biology.

[63]  Juan V. Lorenzo-Ginori,et al.  Digital Signal Processing in the Analysis of Genomic Sequences , 2009 .

[64]  G. Wray,et al.  The Evolution of Gene Regulatory Interactions , 2010 .

[65]  Sampath Kannan,et al.  An Algorithm for Locating Nonoverlapping Regions of Maximum Alignment Score , 1996, SIAM J. Comput..

[66]  M. Waterman,et al.  A method for fast database search for all k-nucleotide repeats. , 1994, Nucleic acids research.

[67]  Angelika Merkel,et al.  Detecting short tandem repeats from genome data: opening the software black box , 2008, Briefings Bioinform..

[68]  Valery Shepelev,et al.  Alpha-satellite DNA of primates: old and new families , 2001, Chromosoma.

[69]  B. Trask,et al.  Segmental duplications: organization and impact within the current human genome project assembly. , 2001, Genome research.

[70]  Hong Yan,et al.  Detection of Tandem Repeats in DNA Sequences Based on Parametric Spectral Estimation , 2009, IEEE Transactions on Information Technology in Biomedicine.

[71]  E. Trifonov 3-, 10.5-, 200- and 400-base periodicities in genome sequences , 1998 .

[72]  R. Mantegna,et al.  Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[73]  Matko Gluncic,et al.  Intragene higher order repeats in neuroblastoma breakpoint family genes distinguish humans from chimpanzees. , 2011, Molecular biology and evolution.

[74]  H. Willard,et al.  Genomic organization of alpha satellite DNA on human chromosome 7: evidence for two distinct alphoid domains on a single chromosome , 1987, Molecular and cellular biology.

[75]  Paul Dan Cristea,et al.  Large scale features in DNA genomic signals , 2003, Signal Process..

[76]  S. Bridges,et al.  Empirical comparison of ab initio repeat finding programs , 2008, Nucleic acids research.

[77]  Mireille Régnier,et al.  Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression , 2006, Bioinform..

[78]  M. Batzer,et al.  Alu repeats and human genomic diversity , 2002, Nature Reviews Genetics.

[79]  V. Paar,et al.  Key-string segmentation algorithm and higher-order repeat 16mer (54 copies) in human alpha satellite DNA in chromosome 7. , 2003, Journal of theoretical biology.

[80]  B. Dujon,et al.  Comparative Genomics and Molecular Dynamics of DNA Repeats in Eukaryotes , 2008, Microbiology and Molecular Biology Reviews.

[81]  Leonidas D. Iasemidis,et al.  Autoregressive Modeling and Feature Analysis of DNA Sequences , 2004, EURASIP J. Adv. Signal Process..

[82]  J. Jurka,et al.  Microsatellites in different eukaryotic genomes: survey and analysis. , 2000, Genome research.

[83]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[84]  E. Eichler,et al.  Recent duplication, domain accretion and the dynamic mutation of the human genome. , 2001, Trends in genetics : TIG.

[85]  N. Pavin,et al.  CENP-B box and pJα sequence distribution in human alpha satellite higher-order repeats (HOR) , 2006, Chromosome Research.

[86]  Pierre Baldi,et al.  Distribution patterns of over-represented k-mers in non-coding yeast DNA , 2002, Bioinform..

[87]  Vladimir Paar,et al.  Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats , 2008, BMC Bioinformatics.

[88]  Judith Klein-Seetharaman,et al.  Evolutionary insights from suffix array-based genome sequence analysis , 2007, Journal of Biosciences.

[89]  R. Linsker,et al.  A measure of DNA periodicity. , 1986, Journal of theoretical biology.

[90]  H. Willard,et al.  Analysis of the centromeric regions of the human genome assembly. , 2004, Trends in genetics : TIG.

[91]  M. Ferguson-Smith,et al.  Human centromeric DNAs , 1997, Human Genetics.

[92]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[93]  V. Paar,et al.  Large Tandem, Higher Order Repeats and Regularly Dispersed Repeat Units Contribute Substantially to Divergence Between Human and Chimpanzee Y Chromosomes , 2010, Journal of Molecular Evolution.

[94]  A. Mighell,et al.  Alu sequences , 1997, FEBS letters.

[95]  Deborah Joseph,et al.  Beyond tandem repeats: complex pattern structures and distant regions of similarity , 2002, ISMB.

[96]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[97]  Süleyman Cenk Sahinalp,et al.  Organization and Evolution of Primate Centromeric DNA from Whole-Genome Shotgun Sequence Data , 2007, PLoS Comput. Biol..

[98]  Wentian Li,et al.  Understanding long-range correlations in DNA sequences , 1994, chao-dyn/9403002.

[99]  Arun Krishnan,et al.  Exhaustive whole-genome tandem repeats search , 2004, Bioinform..

[100]  Mahmood Akhtar,et al.  Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction , 2008, IEEE Journal of Selected Topics in Signal Processing.

[101]  H. Ellegren Microsatellites: simple sequences with complex evolution , 2004, Nature Reviews Genetics.