Segmentation and dimensionality reduction

Sequence segmentation and dimensionality reduction have been used as methods for studying high-dimensional sequences — they both reduce the complexity of the representation of the original data. In this paper we study the interplay of these two techniques. We formulate the problem of segmenting a sequence while modeling it with a basis of small size, thus essentially reducing the dimension of the input sequence. We give three different algorithms for this problem: all combine existing methods for sequence segmentation and dimensionality reduction. For two of the proposed algorithms we prove guarantees for the quality of the solutions obtained. We describe experimental results on synthetic and real datasets, including data on exchange rates and genomic sequences. Our experiments show that the algorithms indeed discover underlying structure in the data, including both segmental structure and interdependencies between the dimensions.

[1]  Jun S. Liu,et al.  Bayesian inference on biopolymer models , 1999, Bioinform..

[2]  Anna R. Karlin,et al.  Spectral analysis of data , 2001, STOC '01.

[3]  Mikhail A. Roytberg,et al.  DNA Segmentation Through the Bayesian Approach , 2000, J. Comput. Biol..

[4]  Aristides Gionis,et al.  Finding recurrent sources in sequences , 2003, RECOMB '03.

[5]  Edward Carlstein,et al.  Change-point problems , 1994 .

[6]  Eamonn J. Keogh,et al.  An online algorithm for segmenting time series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[7]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[8]  Heikki Mannila,et al.  An MDL Method for Finding Haplotype Blocks and for Estimating the Strength of Haplotype Block Boundaries , 2002, Pacific Symposium on Biocomputing.

[9]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[10]  Wentian Li,et al.  DNA segmentation as a model selection process , 2001, RECOMB.

[11]  Dimitrios Gunopulos,et al.  Correlating synchronous and asynchronous data streams , 2003, KDD '03.

[12]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[13]  Heikki Mannila,et al.  Time series segmentation for context recognition in mobile devices , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[14]  Hagai Attias,et al.  Independent Factor Analysis with Temporally Structured Sources , 1999, NIPS.

[15]  Ata Kabán,et al.  Topic Identification in Dynamical Text by Complexity Pursuit , 2003, Neural Processing Letters.

[16]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[17]  Ramakrishna Ramaswamy,et al.  Simplifying the mosaic description of DNA sequences. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[18]  Jan Paces,et al.  A compact view of isochores in the draft human genome sequence , 2002, FEBS letters.