Hybrid Algorithm for Noise-free High Density Clusters with Self-Detection of Best Number of Clusters

Clustering is a process of discovering group of objects such that the objects of the same group are similar, and objects belonging to different groups are dissimilar. A number of clustering algorithms exist that can solve the problem of clustering, but most of them are very sensitive to their input parameters. Minimum Spanning Tree clustering algorithm is capable of detecting clusters with irregular boundaries. A density-based notion of clusters which is designed to discover clusters of arbitrary shape. In this paper we propose a combined approach based on Minimum Spanning Tree based clustering and Density-based clustering for noise-free high density best number of clusters. The algorithm uses a new cluster validation criterion based on the geometric property of data partition of the data set in order to find the proper number of clusters at each level. The algorithm works in two phases. The first phase of the algorithm produces subtrees (noise-free clusters). The second phase finds high density clusters from the subtrees.

[1]  Hans C. van Houwelingen,et al.  The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[2]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[3]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[4]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[5]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[6]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[7]  B. Cutsem,et al.  Detection of outliers and robust estimation using fuzzy clustering , 1993 .

[8]  Benno Stein,et al.  On the Nature of Structure and Its Identification , 1999, WG.

[9]  William Bialek,et al.  How Many Clusters? An Information-Theoretic Perspective , 2003, Neural Computation.

[10]  Niina Päivinen Clustering with a minimum spanning tree of scale-free-like structure , 2005, Pattern Recognit. Lett..

[11]  Philip N. Klein,et al.  A randomized linear-time algorithm to find minimum spanning trees , 1995, JACM.

[12]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[13]  atherine,et al.  Finding the number of clusters in a data set : An information theoretic approach C , 2003 .

[14]  Isak Gath,et al.  Fuzzy clustering for the estimation of the parameters of the components of mixtures of normal distributions , 1989, Pattern Recognit. Lett..

[15]  M. R. Brito,et al.  Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection , 1997 .

[16]  S. Victor,et al.  A NOVEL ALGORITHM FOR DUAL SIMILARITY CLUSTERS USING MINIMUM SPANNING TREE , 2010 .

[17]  Junliang Chen,et al.  ODDC: Outlier Detection Using Distance Distribution Clustering , 2007, PAKDD Workshops.

[18]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[19]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[20]  Shian-Shyong Tseng,et al.  Two-phase clustering process for outliers detection , 2001, Pattern Recognit. Lett..

[21]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[22]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[23]  Ji Zhang,et al.  Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance , 2006, Knowledge and Information Systems.

[24]  A. Hardy On the number of clusters , 1996 .

[25]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[26]  P. Stadler,et al.  Centers of complex networks. , 2003, Journal of theoretical biology.

[27]  Bonnie Ghosh-Dastider,et al.  Outlier Detection and Editing Procedures for Continuous Multivariate Data , 2003 .

[28]  Philip Chan,et al.  Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[29]  Robert E. Tarjan,et al.  Efficient algorithms for finding minimum spanning trees in undirected and directed graphs , 1986, Comb..

[30]  Carlos Soares,et al.  Outlier Detection using Clustering Methods: a data cleaning application , 2004 .

[31]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[32]  Gary Chartrand,et al.  Introduction to Graph Theory , 2004 .

[33]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[34]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[35]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[36]  Raymond T. Ng,et al.  A Unified Notion of Outliers: Properties and Computation , 1997, KDD.

[37]  Yan Zhou,et al.  Minimum Spanning Tree Based Clustering Algorithms , 2006, 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'06).

[38]  P. McCullagh,et al.  How many clusters , 2008 .

[39]  Tetsuo Asano,et al.  Clustering algorithms based on minimum and maximum spanning trees , 1988, SCG '88.

[40]  Y Xu,et al.  Minimum spanning trees for gene expression data clustering. , 2001, Genome informatics. International Conference on Genome Informatics.

[41]  David M. Rocke,et al.  Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data , 2003, Data Mining and Knowledge Discovery.

[42]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[43]  Xiaowei Xu,et al.  SCAN: a structural clustering algorithm for networks , 2007, KDD '07.