TSum: fast, principled table summarization

Given a table where rows correspond to records and columns correspond to attributes, we want to find a small number of patterns that succinctly summarize the dataset. For example, given a set of patient records with several attributes each, how can we find (a) that the "most representative" pattern is, say, (male, adult, *), followed by (*, child, low-cholesterol), etc? We propose TSum, a method that provides a sequence of patterns ordered by their "representativeness." It can decide both which these patterns are, as well as how many are necessary to properly summarize the data. Our main contribution is formulating a general framework, TSum, using compression principles. TSum can easily accommodate different optimization strategies for selecting and refining patterns. The discovered patterns can be used to both represent the data efficiently, as well as interpret it quickly. Extensive experiments demonstrate the effectiveness and intuitiveness of our discovered patterns.

[1]  Geoffrey I. Webb Discovering significant rules , 2006, KDD '06.

[2]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[3]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[4]  Aristides Gionis,et al.  Approximating a collection of frequent sets , 2004, KDD.

[5]  Jacob Ziv,et al.  Compression of Two-Dimensional Images , 1985 .

[6]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[7]  Alfred V. Aho,et al.  Optimal partial-match retrieval when fields are independently specified , 1979, ACM Trans. Database Syst..

[8]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[9]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[10]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[11]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[12]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[13]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[14]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[15]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[16]  Philip S. Yu,et al.  Mining Colossal Frequent Patterns by Core Pattern Fusion , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[17]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.