A fast algorithm for subspace clustering by pattern similarity

Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends the concept of traditional clustering and benefits a wide range of applications, including large scale scientific data analysis, target marketing, Web usage analysis, etc. However, state-of-the-art pattern-based clustering methods (e.g., the pCluster algorithm) can only handle data sets of thousands of records, which makes them inappropriate for many real-life applications. Furthermore, besides the huge data volume, many data sets are also characterized by their sequentiality, for instance, customer purchase records and network event logs are usually modeled as data sequences. Hence, it becomes important to enable pattern-based clustering methods i) to handle large datasets, and ii) to discover pattern similarity embedded in data sequences. In this paper, we present a novel algorithm that offers this capability. Experimental results from both real life and synthetic datasets prove its effectiveness and efficiency.

[1]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[2]  Wen-Chi Hou,et al.  Proceedings of the 11th International Conference on Scientific and Statistical Database Management , 1999 .

[3]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[4]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[5]  H. V. Jagadish,et al.  Semantic Compression and Pattern Extraction with Fascicles , 1999, VLDB.

[6]  Philip S. Yu,et al.  ViST: a dynamic index method for querying XML data by tree structures , 2003, SIGMOD '03.

[7]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[8]  Alison Abbott Scientists urged to raise lobbying efforts , 1999, Nature.

[9]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[10]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[11]  Philip S. Yu,et al.  MaPle: a fast algorithm for maximal pattern-based clustering , 2003, Third IEEE International Conference on Data Mining.

[12]  Philip S. Yu,et al.  Indexing weighted-sequences in large databases , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[13]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[14]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[15]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[16]  Joseph L. Hellerstein,et al.  FARM: a framework for exploring mining spaces with multiple attributes , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[17]  Graham Cameron,et al.  One-stop shop for microarray data , 2000, Nature.

[18]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[19]  Alison Abbott Bioinformatics institute plans public database for gene expression data , 1999, Nature.

[20]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[21]  Haixun Wang,et al.  Landmarks: a new model for similarity-based pattern querying in time series databases , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).