Statistical methods for detecting periodic fragments in DNA sequence data

BackgroundPeriod 10 dinucleotides are structurally and functionally validated factors that influence the ability of DNA to form nucleosomes, histone core octamers. Robust identification of periodic signals in DNA sequences is therefore required to understand nucleosome organisation in genomes. While various techniques for identifying periodic components in genomic sequences have been proposed or adopted, the requirements for such techniques have not been considered in detail and confirmatory testing for a priori specified periods has not been developed.ResultsWe compared the estimation accuracy and suitability for confirmatory testing of autocorrelation, discrete Fourier transform (DFT), integer period discrete Fourier transform (IPDFT) and a previously proposed Hybrid measure. A number of different statistical significance procedures were evaluated but a blockwise bootstrap proved superior. When applied to synthetic data whose period-10 signal had been eroded, or for which the signal was approximately period-10, the Hybrid technique exhibited superior properties during exploratory period estimation. In contrast, confirmatory testing using the blockwise bootstrap procedure identified IPDFT as having the greatest statistical power. These properties were validated on yeast sequences defined from a ChIP-chip study where the Hybrid metric confirmed the expected dominance of period-10 in nucleosome associated DNA but IPDFT identified more significant occurrences of period-10. Application to the whole genomes of yeast and mouse identified ~ 21% and ~ 19% respectively of these genomes as spanned by period-10 nucleosome positioning sequences (NPS).ConclusionsFor estimating the dominant period, we find the Hybrid period estimation method empirically to be the most effective for both eroded and approximate periodicity. The blockwise bootstrap was found to be effective as a significance measure, performing particularly well in the problem of period detection in the presence of eroded periodicity. The autocorrelation method was identified as poorly suited for use with the blockwise bootstrap. Application of our methods to the genomes of two model organisms revealed a striking proportion of the yeast and mouse genomes are spanned by NPS. Despite their markedly different sizes, roughly equivalent proportions (19-21%) of the genomes lie within period-10 spans of the NPS dinucleotides {AA, TT, TA}. The biological significance of these regions remains to be demonstrated. To facilitate this, the genomic coordinates are available as Additional files 1, 2, and 3 in a format suitable for visualisation as tracks on popular genome browsers.ReviewersThis article was reviewed by Prof Tomas Radivoyevitch, Dr Vsevolod Makeev (nominated by Dr Mikhail Gelfand), and Dr Rob D Knight.

[1]  Robert Boorstyn,et al.  Single tone parameter estimation from discrete-time observations , 1974, IEEE Trans. Inf. Theory.

[2]  Sanjit K. Mitra,et al.  Power spectrum analysis for DNA sequences , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[3]  Alfonso G. Fernandez,et al.  Nucleosome positioning determinants. , 2007, Journal of molecular biology.

[4]  Steven A. Tretter,et al.  Estimating the frequency of a noisy sinusoid by linear regression , 1985, IEEE Trans. Inf. Theory.

[5]  E. Ambikairajah,et al.  An integer period DFT for biological sequence processing , 2008, 2008 IEEE International Workshop on Genomic Signal Processing and Statistics.

[6]  E. Trifonov,et al.  The pitch of chromatin DNA is reflected in its nucleotide sequence. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[7]  A. Mclachlan,et al.  Fourteen actin-binding sites on tropomyosin? , 1975, Nature.

[8]  Hanspeter Herzel,et al.  Interpreting correlations in biosequences , 1998 .

[9]  R. Linsker,et al.  A measure of DNA periodicity. , 1986, Journal of theoretical biology.

[10]  Steven M. Johnson,et al.  A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. , 2008, Genome research.

[11]  Ronald W. Davis,et al.  A high-resolution atlas of nucleosome occupancy in yeast , 2007, Nature Genetics.

[12]  H. Drew,et al.  Sequence periodicities in chicken nucleosome core DNA. , 1986, Journal of molecular biology.

[13]  William A. Sethares,et al.  Periodicity transforms , 1999, IEEE Trans. Signal Process..

[14]  S. Karlin,et al.  Comparative DNA analysis across diverse genomes. , 1998, Annual review of genetics.

[15]  E N Trifonov,et al.  Sequence Structure of Hidden 10.4-base Repeat in the Nucleosomes of C. elegans , 2008, Journal of biomolecular structure & dynamics.

[16]  Irene K. Moore,et al.  A genomic code for nucleosome positioning , 2006, Nature.

[17]  Stephen J Freeland,et al.  A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes , 2001, Genome Biology.

[18]  H. Hartley,et al.  Tests of significance in harmonic analysis. , 1949, Biometrika.

[19]  S. Lonardi,et al.  Supplemental Material to : Nucleosome landscape and control of transcription in the human malaria parasite , 2009 .

[20]  G Bernardi,et al.  Isochores and the evolutionary genomics of vertebrates. , 2000, Gene.

[21]  T. Richmond,et al.  Crystal structure of the nucleosome core particle at 2.8 Å resolution , 1997, Nature.

[22]  A. Mclachlan,et al.  The 14-fold periodicity in α-tropomyosin and the interaction with actin , 1976 .

[23]  E. Dougherty,et al.  Genomic Signal Processing and Statistics , 2005 .

[24]  Edward N. Trifonov,et al.  Repertoires of the Nucleosome-Positioning Dinucleotides , 2009, PloS one.

[25]  E V Korotkov,et al.  Method revealing latent periodicity of the nucleotide sequences modified for a case of small samples. , 1999, DNA research : an international journal for rapid publication of reports on genes and genomes.

[26]  Julien Epps A Hybrid Technique for the Periodicity Characterization of Genomic Sequence Data , 2009, EURASIP J. Bioinform. Syst. Biol..

[27]  Stephan C. Schuster,et al.  Nucleosome organization in the Drosophila genome , 2008, Nature.

[28]  Thomas W. Parks,et al.  Orthogonal, exactly periodic subspace decomposition , 2003, IEEE Trans. Signal Process..

[29]  Korbinian Strimmer,et al.  Identifying periodically expressed transcripts in microarray time series data , 2008, Bioinform..

[30]  Avinash Bhandoola,et al.  Biology Direct , 2006 .

[31]  A. Mclachlan,et al.  The 14-fold periodicity in alpha-tropomyosin and the interaction with actin. , 1976, Journal of Molecular Biology.

[32]  Christopher J. R. Illingworth,et al.  Criteria for confirming sequence periodicity identified by Fourier transform analysis: application to GCR2, a candidate plant GPCR? , 2008, Biophysical chemistry.

[33]  Wentian Li,et al.  The Study of Correlation Structures of DNA Sequences: A Critical Review , 1997, Comput. Chem..

[34]  Matthias E. Futschik,et al.  DNA Motifs and Sequence Periodicities , 2006, Silico Biol..

[35]  P. Vandergheynst,et al.  Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences. , 2000, Journal of theoretical biology.

[36]  Guo-Cheng Yuan,et al.  Genomic Sequence Is Highly Predictive of Local Nucleosome Depletion , 2007, PLoS Comput. Biol..

[37]  J. Widom Short-range order in two eukaryotic genomes: relation to chromosome structure. , 1996, Journal of molecular biology.

[38]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[39]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[40]  Hua Ying,et al.  Evidence that Localized Variation in Primate Sequence Divergence Arises from an Influence of Nucleosome Placement on DNA Repair , 2009, Molecular biology and evolution.

[41]  D. Raup,et al.  Periodicity of extinctions in the geologic past. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Ronald K. Pearson,et al.  BMC Bioinformatics BioMed Central Methodology article , 2005 .

[43]  V. Chechetkin,et al.  Search of hidden periodicities in DNA sequences. , 1995, Journal of theoretical biology.

[44]  Cizhong Jiang,et al.  Nucleosome positioning and gene regulation: advances through genomics , 2009, Nature Reviews Genetics.

[45]  Andrey A. Ptitsyn,et al.  Permutation test for periodicity in short time series data , 2006, BMC Bioinformatics.

[46]  William A. Sethares,et al.  Latent Periodicities in Genome Sequences , 2008, IEEE Journal of Selected Topics in Signal Processing.