Unsupervised interaction-preserving discretization of multivariate data

Discretization is the transformation of continuous data into discrete bins. It is an important and general pre-processing technique, and a critical element of many data mining and data management tasks. The general goal is to obtain data that retains as much information in the continuous original as possible. In general, but in particular for exploratory tasks, a key open question is how to discretize multivariate data such that significant associations and patterns are preserved. That is exactly the problem we study in this paper. We propose IPD, an information-theoretic method for unsupervised discretization that focuses on preserving multivariate interactions. To this end, when discretizing a dimension, we consider the distribution of the data over all other dimensions. In particular, our method examines consecutive multivariate regions and combines them if (a) their multivariate data distributions are statistically similar, and (b) this merge reduces the MDL encoding cost. To assess the similarities, we propose $$ ID $$ID, a novel interaction distance that does not require assuming a distribution and permits computation in closed form. We give an efficient algorithm for finding the optimal bin merge, as well as a fast well-performing heuristic. Empirical evaluation through pattern-based compression, outlier mining, and classification shows that by preserving interactions we consistently outperform the state of the art in both quality and speed.

[1]  Klemens Böhm,et al.  CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection , 2013, SDM.

[2]  James F. Allen,et al.  Actions and Events in Interval Temporal Logic , 1994, J. Log. Comput..

[3]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[4]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[5]  Stefan Rüping,et al.  On subgroup discovery in numerical domains , 2009, Data Mining and Knowledge Discovery.

[6]  Huaiqing Wang,et al.  An ICA-Based Multivariate Discretization Algorithm , 2006, KSEM.

[7]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[8]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[9]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[10]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[11]  Stephen D. Bay Multivariate Discretization for Set Mining , 2001, Knowledge and Information Systems.

[12]  Jörg Sander,et al.  Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering , 2008, KDD.

[13]  H. Dette,et al.  Detection of Multiple Structural Breaks in Multivariate Time Series , 2013, 1309.1309.

[14]  José Carlos Príncipe,et al.  A Unified Framework for Quadratic Measures of Independence , 2011, IEEE Transactions on Signal Processing.

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Srinivasan Parthasarathy,et al.  Toward unsupervised correlation preserving discretization , 2005, IEEE Transactions on Knowledge and Data Engineering.

[17]  Divesh Srivastava,et al.  Summarizing Relational Databases , 2009, Proc. VLDB Endow..

[18]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[19]  Petri Myllymäki,et al.  MDL Histogram Density Estimation , 2007, AISTATS.

[20]  Thomas Lützkendorf,et al.  Performance analysis of commercial buildings—Results and experiences from the German demonstration program ‘Energy Optimized Building (EnOB)’ , 2012 .

[21]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[22]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[23]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[24]  Ira Assent,et al.  DensEst: Density Estimation for Data Mining in High Dimensional Spaces , 2009, SDM.

[25]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[26]  Jilles Vreeken,et al.  Summarizing data succinctly with the most informative itemsets , 2012, TKDD.

[27]  Bernard W. Silverman,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[28]  Marc Boullé,et al.  Multivariate Discretization by Recursive Supervised Bipartition of Graph , 2005, MLDM.

[29]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[30]  J. Chiang,et al.  STUDIES IN ASTRONOMICAL TIME SERIES ANALYSIS. VI. BAYESIAN BLOCK REPRESENTATIONS , 2012, 1207.5578.

[31]  Laks V. S. Lakshmanan,et al.  MDL Summarization with Holes , 2005, VLDB.

[32]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[33]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[34]  Divesh Srivastava,et al.  Summary graphs for relational database schemas , 2011, Proc. VLDB Endow..

[35]  Stephen D. Bay,et al.  Detecting change in categorical data: mining contrast sets , 1999, KDD '99.

[36]  Laks V. S. Lakshmanan,et al.  The Generalized MDL Approach for Summarization , 2002, VLDB.

[37]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[38]  Lijian Yang,et al.  Kernel estimation of multivariate cumulative distribution function , 2008 .

[39]  Michael Werman,et al.  A Unified Approach to the Change of Resolution: Space and Gray-Level , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  A. Aue,et al.  Break detection in the covariance structure of multivariate time series models , 2009, 0911.3796.

[41]  Nikolai K. Vereshchagin,et al.  Kolmogorov's structure functions and model selection , 2002, IEEE Transactions on Information Theory.

[42]  Jilles Vreeken,et al.  Finding Good Itemsets by Packing Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[43]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[44]  Christian S. Jensen,et al.  Lightweight graphical models for selectivity estimation without independence assumptions , 2011, Proc. VLDB Endow..

[45]  J. Scargle Studies in astronomical time series analysis. III - Fourier transforms, autocorrelation functions, and cross-correlation functions of unevenly spaced data , 1989 .

[46]  Christos Faloutsos,et al.  Fast and reliable anomaly detection in categorical data , 2012, CIKM.

[47]  Yunmei Chen,et al.  A test of independence based on a generalized correlation function , 2011, Signal Process..

[48]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[49]  Jilles Vreeken,et al.  Krimp: mining itemsets that compress , 2011, Data Mining and Knowledge Discovery.

[50]  Dimitrios Gunopulos,et al.  Approximating multi-dimensional aggregate range queries over real attributes , 2000, SIGMOD '00.

[51]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[52]  Frank Puppe,et al.  Difference-based Estimates for Generalization-aware Subgroup Discovery , 2013, LWA.