Segmentation of time series with long-range fractal correlations

Segmentation is a standard method of data analysis to identify change-points dividing a nonstationary time series into homogeneous segments. However, for long-range fractal correlated series, most of the segmentation techniques detect spurious change-points which are simply due to the heterogeneities induced by the correlations and not to real nonstationarities. To avoid this oversegmentation, we present a segmentation algorithm which takes as a reference for homogeneity, instead of a random i.i.d. series, a correlated series modeled by a fractional noise with the same degree of correlations as the series to be segmented. We apply our algorithm to artificial series with long-range correlations and show that it systematically detects only the change-points produced by real nonstationarities and not those created by the correlations of the signal. Further, we apply the method to the sequence of the long arm of human chromosome 21, which is known to have long-range fractal correlations. We obtain only three segments that clearly correspond to the three regions of different G  +  C composition revealed by means of a multi-scale wavelet plot. Similar results have been obtained when segmenting all human chromosome sequences, showing the existence of previously unknown huge compositional superstructures in the human genome.

[1]  Heikki Mannila,et al.  Comparing segmentations by applying randomization techniques , 2007, BMC Bioinformatics.

[2]  H. Stanley,et al.  Endogenous circadian rhythm in human motor activity uncoupled from circadian influences on cardiac dynamics , 2007, Proceedings of the National Academy of Sciences.

[3]  Jan Beran,et al.  Statistics for long-memory processes , 1994 .

[4]  H. Stanley,et al.  Characterization of sleep stages by correlations in the magnitude and sign of heartbeat increments. , 2000, Physical review. E, Statistical, nonlinear, and soft matter physics.

[5]  J. Barton,et al.  Oxidative Thymine Dimer Repair in the DNA Helix , 1997, Science.

[6]  Lihong Wang Change-in-mean problem for long memory time series models with applications , 2008 .

[7]  Jinde Wang,et al.  Testing and estimating for change in long memory parameter , 2006 .

[8]  Isochores merit the prefix 'iso' , 2002, Comput. Biol. Chem..

[9]  C. Peng,et al.  Mosaic organization of DNA nucleotides. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[10]  G. C. Tiao,et al.  Use of Cumulative Sums of Squares for Retrospective Detection of Changes of Variance , 1994 .

[11]  Q. Shao,et al.  On discriminating between long-range dependence and changes in mean , 2006, math/0607803.

[12]  Steven B. Lowen,et al.  Fractal-Based Point Processes , 2005 .

[13]  P Bernaola-Galván,et al.  Study of statistical correlations in DNA sequences. , 2002, Gene.

[14]  Edward Carlstein,et al.  Change-point problems , 1994 .

[15]  E. Bacry,et al.  Characterizing long-range correlations in DNA sequences from wavelet analysis. , 1995, Physical review letters.

[16]  Bruce J. West,et al.  ON THE UBIQUITY OF 1/f NOISE , 1989 .

[17]  P. Carpena,et al.  Level statistics of words: finding keywords in literary texts and symbolic sequences. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[18]  W Li,et al.  Delineating relative homogeneous G+C domains in DNA sequences. , 2001, Gene.

[19]  G. Varigos,et al.  Sinus Arrhythmia in Acute Myocardial Infarction , 1978, The Medical journal of Australia.

[20]  Thomas Schreiber,et al.  Detecting and Analyzing Nonstationarity in a Time Series Using Nonlinear Cross Predictions , 1997, chao-dyn/9909044.

[21]  Ramón Román-Roldán,et al.  Isochore chromosome maps of the human genome. , 2002, Gene.

[22]  Ivanov PCh,et al.  Sleep-wake differences in scaling behavior of the human heartbeat: analysis of terrestrial and long-term space flight data. , 1999, Europhysics letters.

[23]  Peter Guttorp,et al.  Multiscale detection and location of multiple variance changes in the presence of long memory , 2000 .

[24]  I. Grosse,et al.  Analysis of symbolic sequences using the Jensen-Shannon divergence. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[25]  Wentian Li The Measure of Compositional Heterogeneity in DNA Sequences Is Related to Measures of Complexity , 1997, adap-org/9709007.

[26]  Ramón Román-Roldán,et al.  DECOMPOSITION OF DNA SEQUENCE COMPLEXITY , 1999 .

[27]  Plamen Ch. Ivanov,et al.  Stratification Pattern of Static and Scale-Invariant Dynamic Measures of Heartbeat Fluctuations Across Sleep Stages in Young and Elderly , 2009, IEEE Transactions on Biomedical Engineering.

[28]  Wentian Li The complexity of DNA , 1997 .

[29]  Gong Zhi-qiang,et al.  Analysis of precipitation characteristics of South and North China based on the power-law tail exponents , 2008 .

[30]  M. Rief,et al.  Sequence-dependent mechanics of single DNA molecules , 1999, Nature Structural Biology.

[31]  Harvard Medical School,et al.  Effect of nonstationarities on detrended fluctuation analysis. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[32]  José Martínez-Aroza,et al.  CpGcluster: a distance-based algorithm for CpG-island detection , 2006, BMC Bioinformatics.

[33]  Murad S. Taqqu,et al.  Theory and applications of long-range dependence , 2003 .

[34]  Ivanov PCh,et al.  Stochastic feedback and the regulation of biological rhythms. , 1997, Europhysics letters.

[35]  H. Stanley,et al.  Multiscale aspects of cardiac control , 2004 .

[36]  Lajos Horváth,et al.  Change-Point Detection in Long-Memory Processes , 2001 .

[37]  J. Wylie,et al.  Convergence rates for estimating a change-point with long-range dependent sequences⁎ , 2005 .

[38]  Frederick R. Adler,et al.  Numerical recipes in FORTRAN: the art of scietific computation : W.H. Press, S.A. Teukolsky, W.T. Vettering, and B.P. Flannery, 2nd ed., Cambridge Univ. Press, New York, 1992, 963 pages, $49.95. , 1993 .

[39]  Pedro Carpena,et al.  Statistical characterization of the mobility edge of vibrational states in disordered materials , 1999 .

[40]  P. Lavie,et al.  Correlation differences in heartbeat fluctuations during rest and exercise. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[41]  H. Kantz,et al.  Nonlinear time series analysis , 1997 .

[42]  P. Bernaola-Galván,et al.  Compositional segmentation and long-range fractal correlations in DNA sequences. , 1996, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[43]  Ram Ramaswamy,et al.  Markov models of genome segmentation. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[44]  Daniel T. Schmitt,et al.  Effects of coarse-graining on the scaling behavior of long-range correlated and anti-correlated signals. , 2010, Physica.

[45]  José Martínez-Aroza,et al.  Compositional searching of CpG islands in the human genome. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[46]  P. Guttorp,et al.  Testing for homogeneity of variance in time series: Long memory, wavelets, and the Nile River , 2002 .

[47]  Schwartz,et al.  Method for generating long-range correlations for large systems. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[48]  H. Stanley,et al.  Effect of trends on detrended fluctuation analysis. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[49]  Kensuke Fukuda,et al.  Similarities between communication dynamics in the Internet and the autonomic nervous system , 2003 .

[50]  Dan Graur,et al.  GC composition of the human genome: in search of isochores. , 2005, Molecular biology and evolution.

[51]  M. Teich,et al.  Fractal-Based Point Processes , 2005 .

[52]  Jian Cheng Wong,et al.  Detecting macroeconomic phases in the Dow Jones Industrial Average time series , 2009 .

[53]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[54]  L. A. Nunes Amaral,et al.  Heuristic segmentation of a nonstationary time series. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[55]  P Bernaola-Galván,et al.  High-level organization of isochores into gigantic superstructures in the human genome. , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[56]  P Bernaola-Galván,et al.  Isochore chromosome maps of eukaryotic genomes. , 2001, Gene.

[57]  Michael Hackenberg,et al.  IsoFinder: computational prediction of isochores in genome sequences , 2004, Nucleic Acids Res..

[58]  P. Carpena,et al.  Identifying characteristic scales in the human genome. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[59]  Jan Beran,et al.  Testing for a change of the long-memory parameter , 1996 .

[60]  H. Stanley,et al.  Quantifying signals with power-law correlations: a comparative study of detrended fluctuation analysis and detrended moving average techniques. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[61]  H. Stanley,et al.  Scale invariance in the nonstationarity of human heart rate. , 2000, Physical review letters.

[62]  F. Lillo,et al.  Segmentation algorithm for non-stationary compound Poisson processes , 2010, 1001.2549.

[63]  S. Connolly,et al.  CYCLICAL VARIATION OF THE HEART RATE IN SLEEP APNOEA SYNDROME Mechanisms, and Usefulness of 24 h Electrocardiography as a Screening Technique , 1984, The Lancet.

[64]  E. Ghysels,et al.  Detecting Multiple Breaks in Financial Market Volatility Dynamics , 2002 .

[65]  P. Ivanov,et al.  Effect of extreme data loss on long-range correlated and anticorrelated signals quantified by detrended fluctuation analysis. , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[66]  P. Ivanov Scale-invariant Aspects of Cardiac Dynamics Across Sleep Stages and Circadian Phases , 2007, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[67]  H. Eugene Stanley,et al.  Metal–insulator transition in chains with correlated disorder , 2002, Nature.

[68]  R Hegger,et al.  Denoising human speech signals using chaoslike features. , 2000, Physical review letters.

[69]  Pedro Carpena,et al.  Keyword detection in natural languages and DNA , 2002 .

[70]  Gottfried Mayer-Kress,et al.  Localized measures for nonstationary time-series of physiological data , 1994, Integrative physiological and behavioral science : the official journal of the Pavlovian Society.

[71]  J. Oliver,et al.  Sequence Compositional Complexity of DNA through an Entropic Segmentation Method , 1998 .

[72]  H. Stanley,et al.  Effect of nonlinear filters on detrended fluctuation analysis. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[73]  William H. Press,et al.  Numerical Recipes: FORTRAN , 1988 .

[74]  W Li,et al.  New stopping criteria for segmenting DNA sequences. , 2001, Physical review letters.

[75]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[76]  L. Horváth,et al.  The effect of long-range dependence on change-point estimators , 1997 .

[77]  G. Bernardi,et al.  How not to search for isochores: a reply to Cohen et Al. , 2005, Molecular biology and evolution.

[78]  Shlomo Havlin,et al.  Scaling behaviour of heartbeat intervals obtained by wavelet-based time-series analysis , 1996, Nature.