Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

With the availability of large scale computing platforms and instrumentation for data gathering, increased emphasis is being placed on efficient techniques for analyzing large and extremely high-dimensional datasets. In this paper, we present a novel algebraic technique based on a variant of semi-discrete matrix decomposition (SDD), which is capable of compressing large discrete-valued datasets in an error bounded fashion. We show that this process of compression can be thought of as identifying dominant patterns in underlying data. We derive efficient algorithms for computing dominant patterns, quantify their performance analytically as well as experimentally, and identify applications of these algorithms in problems ranging from clustering to vector quantization. We demonstrate the superior characteristics of our algorithm in terms of (i) scalability to extremely high dimensions; (ii) bounded error; and (iii) hierarchical nature, which enables multiresolution analysis. Detailed experimental results are provided to support these claims.

[1]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[2]  Tamara G. Kolda,et al.  A semidiscrete matrix decomposition for latent semantic indexing information retrieval , 1998, TOIS.

[3]  Wojciech Szpankowski,et al.  Semi-discrete matrix transforms (SDD) for image and video compression , 2002, Proceedings DCC 2002. Data Compression Conference.

[4]  Tamara G. Kolda,et al.  Latent Semantic Indexing Via a Semi-Discrete Matrix Decomposition , 1999 .

[5]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[6]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[7]  Moody T. Chu,et al.  The Centroid Decomposition: Relationships between Discrete Variational Decompositions and SVDs , 2001, SIAM J. Matrix Anal. Appl..

[8]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[9]  Dianne P. O'Leary,et al.  Digital Image Compression by Outer Product Expansion , 1983, IEEE Trans. Commun..

[10]  Tamara G. Kolda,et al.  Algorithm 805: computation and uses of the semidiscrete matrix decomposition , 2000, TOMS.

[11]  D. B. Skillicornfmcconnell Outlier Detection Using SemiDiscrete Decomposition , 2002 .

[12]  Vipin Kumar,et al.  Hypergraph Based Clustering in High-Dimensional Data Sets: A Summary of Results , 1998, IEEE Data Eng. Bull..

[13]  Joydeep Ghosh,et al.  Value-balanced agglomerative connectivity clustering , 2001, SPIE Defense + Commercial Sensing.

[14]  D. O’Leary,et al.  Computation and Uses of the Semidiscrete Matrix Decomposition , 1999 .

[15]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[16]  G KoldaTamara,et al.  A semidiscrete matrix decomposition for latent semantic indexing information retrieval , 1998 .

[17]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[18]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[19]  L. Rabiner,et al.  The acoustics, speech, and signal processing society - A historical perspective , 1984, IEEE ASSP Magazine.

[20]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.

[21]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..