Independence diagrams: A technique for data visualization

An important issue in data visualization is the recognition of complex dependencies between attributes. Past techniques for identifying attribute dependence include correlation coefficients, scatterplots, and equi-width histograms. These techniques are sen- sitive to outliers, and often are not sufficiently informative to identify the kind of attribute dependence present. We propose a new ap- proach, which we call independence diagrams. We divide each at- tribute into ranges; for each pair of attributes, the combination of these ranges defines a two-dimensional grid. For each cell of this grid, we store the number of data items in it. We display the grid, scaling each attribute axis so that the displayed width of a range is proportional to the total number of data items within that range. The brightness of a cell is proportional to the density of data items in it. As a result, both attributes are independently normalized by fre- quency, ensuring insensitivity to outliers and skew, and allowing specific focus on attribute dependencies. Furthermore, indepen- dence diagrams provide quantitative measures of the interaction be- tween two attributes, and allow formal reasoning about issues such as statistical significance. We have addressed several technical challenges in making independence diagrams work, ranging from the treatment of categorical attributes to visual artifacts of cell-to- pixel mapping. Our experimental evaluation, using both AT&T and synthetic data, shows that independence diagrams allow the easy identification of various kinds of attribute dependence that would be difficult to identify using conventional techniques. © 2000 SPIE and IS&T. (S1017-9909(00)01704-9)

[1]  Stefan Berchtold,et al.  Independence Diagrams: A Technique for Visual Data Mining , 1998, KDD.

[2]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[3]  Rakesh Agrawal,et al.  A One-Pass Space-Efficient Algorithm for Finding Quantiles , 1995, COMAD.

[4]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[5]  Hans-Peter Kriegel,et al.  Supporting data mining of large databases by visual feedback queries , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[6]  John Alan McDonald,et al.  Variable Resolution Bivariate Plots , 1997 .

[7]  Yasuhiko Morimoto,et al.  Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization , 1996, SIGMOD '96.

[8]  George Hripcsak,et al.  Two Applications of Statistical Modelling to Natural Language Processing , 1995, AISTATS.

[9]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[10]  D. F. Andrews,et al.  PLOTS OF HIGH-DIMENSIONAL DATA , 1972 .

[11]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[12]  Haim Levkowitz,et al.  Color Theory and Modeling for Computer Graphics, Visualization, and Multimedia Applications , 1997 .

[13]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[14]  Hans-Peter Kriegel,et al.  'Circle Segments': A Technique for Visually Exploring Large Multidimensional Data Sets , 1996 .

[15]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[16]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[17]  Sanjay Ranka,et al.  A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data , 1997, VLDB.

[18]  Heidrun Schumann,et al.  Visual Data Mining , 2002, Eurographics.