Efficient layered density-based clustering of categorical data

A challenge involved in applying density-based clustering to categorical biomedical data is that the "cube" of attribute values has no ordering defined, making the search for dense subspaces slow. We propose the HIERDENC algorithm for hierarchical density-based clustering of categorical data, and a complementary index for searching for dense subspaces efficiently. The HIERDENC index is updated when new objects are introduced, such that clustering does not need to be repeated on all objects. The updating and cluster retrieval are efficient. Comparisons with several other clustering algorithms showed that on large datasets HIERDENC achieved better runtime scalability on the number of objects, as well as cluster quality. By fast collapsing the bicliques in large networks we achieved an edge reduction of as much as 86.5%. HIERDENC is suitable for large and quickly growing datasets, since it is independent of object ordering, does not require re-clustering when new data emerges, and requires no user-specified input parameters.

[1]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[2]  Kevin D. Reilly,et al.  SEQOPTICS: a protein sequence clustering system , 2006, BMC Bioinformatics.

[3]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[4]  H. Akaike A new look at the statistical model identification , 1974 .

[5]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[6]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[7]  Michael Schroeder,et al.  Unraveling Protein Networks with Power Graph Analysis , 2008, PLoS Comput. Biol..

[8]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[9]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[10]  Robert Krauthgamer,et al.  The Black-Box Complexity of Nearest Neighbor Search , 2004, ICALP.

[11]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[12]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[13]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[14]  R. Mojena,et al.  Hierarchical Grouping Methods and Stopping Rules: An Evaluation , 1977, Comput. J..

[15]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[16]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[17]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[18]  Andreas Rudolph,et al.  Techniques of Cluster Algorithms in Data Mining , 2002, Data Mining and Knowledge Discovery.

[19]  Jon M. Kleinberg,et al.  Segmentation problems , 2004, JACM.

[20]  Bin Zhang,et al.  Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R , 2008, Bioinform..

[21]  Xiaogang Wang,et al.  Clustering by common friends finds locally significant proteins mediating modules , 2007, Bioinform..

[22]  Chris H. Q. Ding,et al.  Adaptive dimension reduction for clustering high dimensional data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[23]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[24]  Xiaogang Wang,et al.  Finding molecular complexes through multiple layer clustering of protein interaction networks , 2007, Int. J. Bioinform. Res. Appl..

[25]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[26]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[27]  Mohammed J. Zaki,et al.  CLICKS: Mining Subspace Clusters in Categorical Data via K-Partite Maximal Cliques , 2005, 21st International Conference on Data Engineering (ICDE'05).

[28]  Tao Li,et al.  Entropy-based criterion in categorical clustering , 2004, ICML.

[29]  Zhang Yi,et al.  Clustering Categorical Data , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[30]  Jinyuan You,et al.  CLOPE: a fast and effective clustering algorithm for transactional data , 2002, KDD.

[31]  Jian Pei,et al.  DHC: a density-based hierarchical clustering method for time series gene expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[32]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[33]  Alejandro A. Schäffer,et al.  Database indexing for production MegaBLAST searches , 2008, Bioinform..

[34]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[35]  William Andreopoulos,et al.  Clustering algorithms for categorical data , 2006 .

[36]  Aristides Gionis,et al.  Dimension induced clustering , 2005, KDD '05.

[37]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[38]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[39]  Xiaogang Wang,et al.  Hierarchical Density-Based Clustering of Categorical Data and a Simplification , 2007, PAKDD.

[40]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.