Compression-based data mining of sequential data

The vast majority of data mining algorithms require the setting of many input parameters. The dangers of working with parameter-laden algorithms are twofold. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible. A parameter-light algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics, learning, and computational theory hold great promise for a parameter-light data-mining paradigm. The results are strongly connected to Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen lines of code. We will show that this approach is competitive or superior to many of the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/XML/video datasets. As a further evidence of the advantages of our method, we will demonstrate its effectiveness to solve a real world classification problem in recommending printing services and products.

[1]  Lawrence B. Holder,et al.  Graph-Based Data Mining , 2000, IEEE Intell. Syst..

[2]  Yingying Wen,et al.  A compression based algorithm for Chinese word segmentation , 2000, CL.

[3]  Jakub Segen Graph Clustering and Model Learning by Data Compression , 1990, ML.

[4]  Charles Elkan,et al.  Magical thinking in data mining: lessons from CoIL challenge 2000 , 2001, KDD '01.

[5]  Cyrus Shahabi,et al.  TSA-tree: a wavelet-based approach to improve the efficiency of multi-level surprise and trend queries on time-series data , 2000, Proceedings. 12th International Conference on Scientific and Statistica Database Management.

[6]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[7]  Dragomir Anguelov,et al.  Mining The Stock Market : Which Measure Is Best ? , 2000 .

[8]  Christos Faloutsos,et al.  Fully automatic cross-associations , 2004, KDD.

[9]  Philip K. Chan,et al.  Learning rules for time series anomaly detection , 2005 .

[10]  David Loewenstern,et al.  Significantly lower entropy estimates for natural DNA sequences , 1997, Proceedings DCC '97. Data Compression Conference.

[11]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[12]  Vittorio Loreto,et al.  Artificial sequences and complexity measures , 2004, cond-mat/0403233.

[13]  Pedro M. Domingos A Process-Oriented Heuristic for Model Selection , 1998, ICML.

[14]  Diane J. Cook,et al.  Graph-based anomaly detection , 2003, KDD '03.

[15]  William I. Gasarch,et al.  Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[16]  Piotr Indyk,et al.  Mining the stock market (extended abstract): which measure is best? , 2000, KDD '00.

[17]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[18]  Lila L. Gatlin,et al.  Information theory and the living system , 1972 .

[19]  Eamonn J. Keogh,et al.  Making Time-Series Classification More Accurate Using Learned Constraints , 2004, SDM.

[20]  Matthew B Kennel,et al.  Testing time symmetry in time series using data compression dictionaries. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[21]  David L. Dowe,et al.  Message Length as an Effective Ockham's Razor in Decision Tree Induction , 2001, International Conference on Artificial Intelligence and Statistics.

[22]  Fabrizio Ferrandina,et al.  Implementing Lazy Database Updates for an Object Database System , 1994, VLDB.

[23]  Eamonn J. Keogh,et al.  Clustering of time-series subsequences is meaningless: implications for previous and future research , 2004, Knowledge and Information Systems.

[24]  Konstantinos Kalpakis,et al.  Distance measures for effective clustering of ARIMA time-series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[25]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[26]  Yoshikiyo Kato,et al.  Fault Detection by Mining Association Rules from House-keeping Data , 2001 .

[27]  P. Christen,et al.  Towards Automated Data Linkage and Deduplication , 2022 .

[28]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[29]  Kris Popat,et al.  A Hierarchical Model for Clustering and Categorising Documents , 2002, ECIR.

[30]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[31]  Baltasar Beferull-Lozano,et al.  Compression for recognition and content-based retrieval , 2000, 2000 10th European Signal Processing Conference.

[32]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[33]  Carla E. Brodley,et al.  Compression and machine learning: a new perspective on feature space vectors , 2006, Data Compression Conference (DCC'06).

[34]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[35]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[36]  H. Hirsh,et al.  DNA Sequence Classification Using Compression-Based Induction , 1995 .

[37]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[38]  W. Teahan,et al.  Comment on "Language trees and zipping". , 2003, Physical review letters.

[39]  Dipankar Dasgupta,et al.  Novelty detection in time series data using ideas from immunology , 1996 .

[40]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[41]  Christos Faloutsos,et al.  Parameter-free spatial data mining using MDL , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[42]  Joshua Goodman Extended Comment on Language Trees and Zipping , 2002, ArXiv.

[43]  Changzhou Wang,et al.  Supporting content-based searches on time series via approximation , 2000, Proceedings. 12th International Conference on Scientific and Statistica Database Management.

[44]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[45]  C. Kit A Goodness Measure for Phrase Learning via Compression with the MDL Principle , 1998 .

[46]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[47]  Jorma Rissanen,et al.  MDL-Based Decision Tree Pruning , 1995, KDD.

[48]  Junshui Ma,et al.  Online novelty detection on temporal sequences , 2003, KDD '03.

[49]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[50]  Arthur Flexer,et al.  Statistical evaluation of neural networks experiments: Minimum requirements and current practice , 1994 .

[51]  Eamonn J. Keogh,et al.  UCR Time Series Data Mining Archive , 1983 .

[52]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[53]  Benoist,et al.  On the Entropy of DNA: Algorithms and Measurements based on Memory and Rapid Convergence , 1994 .

[54]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[55]  Padhraic Smyth,et al.  Deformable Markov model templates for time-series pattern matching , 2000, KDD '00.

[56]  Trevor I. Dix,et al.  Sequence Complexity for Biological Sequence Analysis , 2000, Comput. Chem..

[57]  Dimitrios Gunopulos,et al.  Indexing multi-dimensional time-series with support for multiple distance measures , 2003, KDD '03.