Mixtures of Rectangles: Interpretable Soft Clustering

To be eeective, data-mining has to conclude with a succinct description of the data. To this end, we explore a clustering technique that nds dense regions in data. By constraining our model in a speciic way, we are able to represent the interesting regions as an intersection of intervals. This has the advantage of being easily read and understood by humans. Speciically, we t the data to a mixture model in which each component is a hyper-rectangle in M-dimensional space. Hyper-rectangles may overlap, meaning some points can have soft membership of several components. Each component is simply described by, for each attribute, lower and upper bounds of points in the cluster. The computational problem of nding a locally maximum-likelihood collection of k rectangles is made practical by allowing the rectangles to have soft \tails" in the early stages of an EM-like optimization scheme. Our method requires no user-supplied parameters except for the desired number of clusters. These advantages make it highly attractive for \turn-key" data-mining application. We demonstrate the usefulness of the method in subspace clustering for synthetic data, and in real-life datasets. We also show its eeective-ness in a classiication setting.

[1]  William H. Press,et al.  Numerical Recipes in C, 2nd Edition , 1992 .

[2]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[3]  R. Ng,et al.  Eecient and Eeective Clustering Methods for Spatial Data Mining , 1994 .

[4]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[5]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[6]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[7]  W. Maass,et al.  Eecient Learning with Virtual Threshold Gates , 1997 .

[8]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[9]  Andrew W. Moore,et al.  Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees , 1998, NIPS.

[10]  Marina Meila,et al.  An Experimental Comparison of Several Clustering and Initialization Methods , 1998, UAI.

[11]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[12]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[13]  Nicholas I. Fisher,et al.  Bump hunting in high-dimensional data , 1999, Stat. Comput..

[14]  Sanjay Ranka,et al.  An Efficient Space-Partitioning Based Algorithm for the K-Means Clustering , 1999, PAKDD.

[15]  Daniel P. Fasulo,et al.  An Analysis of Recent Work on Clustering Algorithms , 1999 .

[16]  Andrew W. Moore,et al.  The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data , 2000, UAI.

[17]  Philip S. Yu,et al.  Clustering through decision tree construction , 2000, CIKM '00.

[18]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[19]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.