Estimating the number of segments in time series data using permutation tests

Segmentation is a popular technique for discovering structure in time series data. We address the largely open problem of estimating the number of segments that can be reliably discovered. We introduce a novel method for the problem, called Pete. Pete is based on permutation testing. The problem is an instance of model (dimension) selection. The proposed method analyzes the possible overfit of a model to the available data rather than using a term for penalizing model complexity. In this respect the approach is more similar to cross-validation than regularization based techniques (e.g., AIC, BIC, MDL, MML). Furthermore, the method produces a p value for each increase in the number of segments. This gives the user an overview of the statistical significance of segmentations. We evaluate the performance of the proposed method using both synthetic and real time series data. The experiments show that permutation testing gives realistic results for the number of reliably identifiable segments and compares favorably with Monte Carlo cross-validation (MCCV) and commonly used BIC criteria.

[1]  Masashi Sugiyama,et al.  Subspace Information Criterion for Model Selection , 2001, Neural Computation.

[2]  Eamonn J. Keogh,et al.  An online algorithm for segmenting time series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[3]  Sergio VerdÂ,et al.  The Minimum Description Length Principle in Coding and Modeling , 2000 .

[4]  H. Akaike A new look at the statistical model identification , 1974 .

[5]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[6]  Hannu Toivonen,et al.  Holocene temperature changes in northern Fennoscandia reconstructed from chironomids using Bayesian modelling , 2002 .

[7]  R. Macarthur ON THE RELATIVE ABUNDANCE OF BIRD SPECIES. , 1957, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Eamonn J. Keogh,et al.  A Probabilistic Approach to Fast Pattern Matching in Time Series Databases , 1997, KDD.

[9]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[10]  Peter Secretan Learning , 1965, Mental Health.

[11]  Heikki Mannila,et al.  Time series segmentation for context recognition in mobile devices , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[12]  Padhraic Smyth,et al.  Model selection for probabilistic clustering using cross-validated likelihood , 2000, Stat. Comput..

[13]  K. Bennett,et al.  Determination of the number of zones in a biostratigraphical sequence. , 1996, The New phytologist.

[14]  Padhraic Smyth,et al.  Clustering Using Monte Carlo Cross-Validation , 1996, KDD.

[15]  Padhraic Smyth,et al.  Segmental Semi-Markov Models for Endpoint Detection in Plasma Etching , 2000 .

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[18]  Heikki Mannila,et al.  Learning, Mining, or Modeling? A Case Study from Paleocology , 1998, Discovery Science.

[19]  Hagit Shatkay,et al.  Approximate queries and representations for large data sequences , 1996, Proceedings of the Twelfth International Conference on Data Engineering.