Clustering by fast search and merge of local density peaks for gene expression microarray data

Clustering is an unsupervised approach to classify elements based on their similarity, and it is used to find the intrinsic patterns of data. There are enormous applications of clustering in bioinformatics, pattern recognition, and astronomy. This paper presents a clustering approach based on the idea that density wise single or multiple connected regions make a cluster, in which density maxima point represents the center of the corresponding density region. More precisely, our approach firstly finds the local density regions and subsequently merges the density connected regions to form the meaningful clusters. This idea empowers the clustering procedure, in which outliers are automatically detected, higher dense regions are intuitively determined and merged to form clusters of arbitrary shape, and clusters are identified regardless the dimensionality of space in which they are embedded. Extensive experiments are performed on several complex data sets to analyze and compare our approach with the state-of-the-art clustering methods. In addition, we benchmarked the algorithm on gene expression microarray data sets for cancer subtyping; to distinguish normal tissues from tumor; and to classify multiple tissue data sets.

[1]  T. Golub,et al.  Transformation from committed progenitor to leukaemia stem cell initiated by MLL–AF9 , 2006, Nature.

[2]  Jan Baumbach,et al.  Comparing the performance of biomedical clustering methods , 2015, Nature Methods.

[3]  Tom Ronan,et al.  Avoiding common pitfalls when clustering biological data , 2016, Science Signaling.

[4]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[5]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[6]  Anil K. Jain,et al.  Data Clustering: A User's Dilemma , 2005, PReMI.

[7]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[8]  Dit-Yan Yeung,et al.  Robust path-based spectral clustering , 2008, Pattern Recognit..

[9]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[10]  Khaled Rasheed,et al.  Simultaneously Removing Noise and Selecting Relevant Features for High Dimensional Noisy Data , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[11]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[12]  Jill P. Mesirov,et al.  Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets , 2007, PloS one.

[13]  H. Horvitz,et al.  MicroRNA expression profiles classify human cancers , 2005, Nature.

[14]  Jill P. Mesirov,et al.  A resampling-based method for class discovery and visualization of gene expression microarray data , 2003 .

[15]  Andy Harter,et al.  Parameterisation of a stochastic model for human face identification , 1994, Proceedings of 1994 IEEE Workshop on Applications of Computer Vision.

[16]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[17]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.