Aggregating time partitions

Partitions of sequential data exist either per se or as a result of sequence segmentation algorithms. It is often the case that the same timeline is partitioned in many different ways. For example, different segmentation algorithms produce different partitions of the same underlying data points. In such cases, we are interested in producing an aggregate partition, i.e., a segmentation that agrees as much as possible with the input segmentations. Each partition is defined as a set of continuous non-overlapping segments of the timeline. We show that this problem can be solved optimally in polynomial time using dynamic programming. We also propose faster greedy heuristics that work well in practice. We experiment with our algorithms and we demonstrate their utility in clustering the behavior of mobile-phone users and combining the results of different segmentation algorithms on genomic sequences.

[1]  Ronald Fagin,et al.  Comparing and aggregating rankings with ties , 2004, PODS '04.

[2]  Aristides Gionis,et al.  Finding recurrent sources in sequences , 2003, RECOMB '03.

[3]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[4]  Heikki Mannila,et al.  Time series segmentation for context recognition in mobile devices , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[5]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[6]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[7]  Salvatore J. Stolfo,et al.  A framework for constructing features and models for intrusion detection systems , 2000, TSEC.

[8]  O. Vorobyev,et al.  Discrete multivariate distributions , 2008, 0811.0406.

[9]  Nir Ailon,et al.  Aggregating inconsistent information: Ranking and clustering , 2008 .

[10]  Carla E. Brodley,et al.  Temporal sequence learning and data reduction for anomaly detection , 1998, CCS '98.

[11]  Dan Gusfield An Overview of Haplotyping via Perfect Phylogeny: Theory, Algorithms and Programs , 2003, ICTAI.

[12]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[13]  Eamonn J. Keogh,et al.  An online algorithm for segmenting time series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[14]  Joachim M. Buhmann,et al.  Combining partitions by probabilistic label aggregation , 2005, KDD '05.

[15]  Nathan Eagle,et al.  Machine perception and learning of complex social systems , 2005 .

[16]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[17]  Sushil Jajodia,et al.  Enhancing Profiles for Anomaly Detection Using Time Granularities , 2002, J. Comput. Secur..

[18]  J. Novembre,et al.  Finding haplotype block boundaries by using the minimum-description-length principle. , 2003, American journal of human genetics.

[19]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[20]  S. P. Fodor,et al.  Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21 , 2001, Science.

[21]  K JainAnil,et al.  Combining Multiple Clusterings Using Evidence Accumulation , 2005 .

[22]  Heikki Mannila,et al.  An MDL Method for Finding Haplotype Blocks and for Estimating the Strength of Haplotype Block Boundaries , 2002, Pacific Symposium on Biocomputing.

[23]  Heikki Mannila,et al.  Discovery of Frequent Episodes in Event Sequences , 1997, Data Mining and Knowledge Discovery.

[24]  Evimaria Terzi,et al.  Efficient Algorithms for Sequence Segmentation , 2006, SDM.

[25]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[26]  Wentian Li,et al.  DNA segmentation as a model selection process , 2001, RECOMB.

[27]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[28]  Salvatore J. Stolfo,et al.  Detecting sound events in basketball video archive , 2001 .

[29]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[31]  M. Waterman,et al.  A dynamic programming algorithm for haplotype block partitioning , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[33]  Mikhail J. Atallah,et al.  Reliable detection of episodes in event sequences , 2004, Knowledge and Information Systems.

[34]  Heikki Mannila,et al.  Genome segmentation using piecewise constant intensity models and reversible jump MCMC , 2002, ECCB.

[35]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[36]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.