Scalable Hierarchical Clustering Method for Sequences of Categorical Values

Data clustering methods have many applications in the area of data mining. Traditional clustering algorithms deal with quantitative or categorical data points. However, there exist many important databases that store categorical data sequences, where significant knowledge is hidden behind sequential dependencies between the data. In this paper we introduce a problem of clustering categorical data sequences and present an efficient scalable algorithm to solve the problem. Our algorithm implements the general idea of agglomerative hierarchical clustering and uses frequently occurring subsequences as features describing data sequences. The algorithm not only discovers a set of high quality clusters containing similar data sequences but also provides descriptions of the discovered clusters.

[1]  Alain Ketterlin,et al.  Clustering Sequences of Complex Objects , 1997, KDD.

[2]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[5]  Oren Etzioni,et al.  Towards adaptive Web sites: Conceptual framework and case study , 1999, Artif. Intell..

[6]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[7]  Ramakrishnan Srikant,et al.  The Quest Data Mining System , 1996, KDD.

[8]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[9]  Vipin Kumar,et al.  Clustering Based On Association Rule Hypergraphs , 1997, DMKD.

[10]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[11]  Arun N. Swami,et al.  Clustering Data Without Distance Functions , 1998, IEEE Data Eng. Bull..

[12]  Sudha Ram,et al.  Proceedings of the 1997 ACM SIGMOD international conference on Management of data , 1997, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[13]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[14]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[15]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[16]  Mohammed J. Zaki,et al.  Mining features for sequence classification , 1999, KDD '99.

[17]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[18]  Heikki Mannila,et al.  Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining , 1997 .

[19]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[20]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[21]  Jennifer Widom,et al.  Proceedings of the 1996 ACM SIGMOD international conference on Management of data , 1996, PODS 1996.

[22]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[23]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.

[24]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.