Towards parameter-free data mining

Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameter-laden algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process.Data mining algorithms should have as few parameters as possible, ideally none. A parameter-free algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics and computational theory hold great promise for a parameter-free data-mining paradigm. The results are motivated by observations in Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen or so lines of code. We will show that this approach is competitive or superior to the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets.

[1]  Lila L. Gatlin,et al.  Information theory and the living system , 1972 .

[2]  Eamonn J. Keogh,et al.  Making Time-Series Classification More Accurate Using Learned Constraints , 2004, SDM.

[3]  Matthew B Kennel,et al.  Testing time symmetry in time series using data compression dictionaries. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[4]  William I. Gasarch,et al.  Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[5]  Bin Ma,et al.  Chain letters & evolutionary histories. , 2003, Scientific American.

[6]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[7]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[8]  Eamonn J. Keogh,et al.  UCR Time Series Data Mining Archive , 1983 .

[9]  C. Kit A Goodness Measure for Phrase Learning via Compression with the MDL Principle , 1998 .

[10]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[11]  Benoist,et al.  On the Entropy of DNA: Algorithms and Measurements based on Memory and Rapid Convergence , 1994 .

[12]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[13]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[14]  Piotr Indyk,et al.  Mining the stock market (extended abstract): which measure is best? , 2000, KDD '00.

[15]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[16]  Junshui Ma,et al.  Online novelty detection on temporal sequences , 2003, KDD '03.

[17]  Dragomir Anguelov,et al.  Mining The Stock Market : Which Measure Is Best ? , 2000 .

[18]  A. Wear CIRCULATION , 1964, The Lancet.

[19]  KeoghEamonn,et al.  Clustering of time-series subsequences is meaningless , 2005 .

[20]  Konstantinos Kalpakis,et al.  Distance measures for effective clustering of ARIMA time-series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[21]  Baltasar Beferull-Lozano,et al.  Compression for recognition and content-based retrieval , 2000, 2000 10th European Signal Processing Conference.

[22]  Pedro M. Domingos A Process-Oriented Heuristic for Model Selection , 1998, ICML.

[23]  Charles Elkan,et al.  Magical thinking in data mining: lessons from CoIL challenge 2000 , 2001, KDD '01.

[24]  Cyrus Shahabi,et al.  TSA-tree: a wavelet-based approach to improve the efficiency of multi-level surprise and trend queries on time-series data , 2000, Proceedings. 12th International Conference on Scientific and Statistica Database Management.

[25]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[26]  Diane J. Cook,et al.  Graph-based anomaly detection , 2003, KDD '03.

[27]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[28]  W. Teahan,et al.  Comment on "Language trees and zipping". , 2003, Physical review letters.

[29]  Eamonn J. Keogh,et al.  Clustering of time-series subsequences is meaningless: implications for previous and future research , 2004, Knowledge and Information Systems.

[30]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[31]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[32]  H. Hirsh,et al.  DNA Sequence Classification Using Compression-Based Induction , 1995 .

[33]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[34]  Vittorio Loreto,et al.  Data Compression approach to Information Extraction and Classification , 2004, ArXiv.

[35]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[36]  Arthur Flexer,et al.  Statistical evaluation of neural networks experiments: Minimum requirements and current practice , 1994 .

[37]  Joshua Goodman Extended Comment on Language Trees and Zipping , 2002, ArXiv.

[38]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[39]  Changzhou Wang,et al.  Supporting content-based searches on time series via approximation , 2000, Proceedings. 12th International Conference on Scientific and Statistica Database Management.

[40]  Funda Ergün,et al.  Comparing Sequences with Segment Rearrangements , 2003, FSTTCS.

[41]  Trevor I. Dix,et al.  Sequence Complexity for Biological Sequence Analysis , 2000, Comput. Chem..

[42]  Dipankar Dasgupta,et al.  Novelty detection in time series data using ideas from immunology , 1996 .

[43]  David Loewenstern,et al.  Significantly lower entropy estimates for natural DNA sequences , 1997, Proceedings DCC '97. Data Compression Conference.

[44]  Yoshikiyo Kato,et al.  Fault Detection by Mining Association Rules from House-keeping Data , 2001 .

[45]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[46]  Dimitrios Gunopulos,et al.  Indexing multi-dimensional time-series with support for multiple distance measures , 2003, KDD '03.

[47]  Yingying Wen,et al.  A compression based algorithm for Chinese word segmentation , 2000, CL.

[48]  Jakub Segen Graph Clustering and Model Learning by Data Compression , 1990, ML.

[49]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[50]  Padhraic Smyth,et al.  Deformable Markov model templates for time-series pattern matching , 2000, KDD '00.