A novel DBSCAN with entropy and probability for mixed data

In big data situation, to detect clusters of different size and shape is a challenging and imperative task. Density based clustering approaches have been widely used in many areas of science due to its simplicity and the ability to detect clusters of different sizes and shapes over the last several years. With diverse conversion on categorical data, a modified version of the DBSCAN algorithm is proposed to cluster mixed data, noted as density based clustering algorithm for mixed data with integration of entropy and probability distribution (EPDCA). Optional and various conversions are provided for clustering process with adaptability. Some benchmark data sets from UCI have been selected for testing the capability and validity of EPDCA. It was shown that the clustering results of EPDCA are considerably improved, especially on automatically number of clusters formed, noise discovery and time elapsed to form clusters.

[1]  Jaya Sil,et al.  Simultaneous feature selection and clustering with mixed features by multi objective genetic algorithm , 2014, Int. J. Hybrid Intell. Syst..

[2]  Chunguang Zhou,et al.  An improved k-prototypes clustering algorithm for mixed numeric and categorical data , 2013, Neurocomputing.

[3]  Jie Yu,et al.  Measuring semantic similarity between words by removing noise and redundancy in web snippets , 2011, Concurr. Comput. Pract. Exp..

[4]  Gang Niu,et al.  Information-Maximization Clustering Based on Squared-Loss Mutual Information , 2014, Neural Computation.

[5]  Ralph Mac Nally,et al.  Bayesian clustering with AutoClass explicitly recognises uncertainties in landscape classification , 2007 .

[6]  Lutgarde M. C. Buydens,et al.  KNN-kernel density-based clustering for high-dimensional multivariate data , 2006, Comput. Stat. Data Anal..

[7]  Jeong-Hoon Lee,et al.  An effective dissimilarity measure for clustering of high-dimensional categorical data , 2012, Knowledge and Information Systems.

[8]  Sherif Sakr,et al.  Cloud-hosted databases: technologies, challenges and opportunities , 2014, Cluster Computing.

[9]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[10]  Chung-Chian Hsu,et al.  Mining of mixed data with application to catalog marketing , 2007, Expert Syst. Appl..

[11]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[12]  Xiao Zhang,et al.  MrHeter: improving MapReduce performance in heterogeneous environments , 2016, Cluster Computing.

[13]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[14]  Hoel Le Capitaine,et al.  A Cluster-Validity Index Combining an Overlap Measure and a Separation Measure Based on Fuzzy-Aggregation Operators , 2011, IEEE Transactions on Fuzzy Systems.

[15]  Anand Singh Jalal,et al.  A Density Based Algorithm for Discovering Density Varied Clusters in Large Spatial Databases , 2010 .

[16]  Amitava Datta,et al.  A novel algorithm for fast and scalable subspace clustering of high-dimensional data , 2015, Journal of Big Data.

[17]  Zhe Wang,et al.  A novel cluster center initialization method for the k-prototypes algorithms using centrality and distance , 2015 .

[18]  Xiangfeng Luo,et al.  Measuring the semantic discrimination capability of association relations , 2014, Concurr. Comput. Pract. Exp..

[19]  Lipika Dey,et al.  A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set , 2007, Pattern Recognit. Lett..

[20]  Maoguo Gong,et al.  Unsupervised evolutionary clustering algorithm for mixed type data , 2010, IEEE Congress on Evolutionary Computation.

[21]  Wanjiun Liao,et al.  A Mathematical Theory for Clustering in Metric Spaces , 2015, IEEE Transactions on Network Science and Engineering.

[22]  DuttaDipankar,et al.  Simultaneous feature selection and clustering with mixed features by multi objective genetic algorithm , 2014 .

[23]  Gautam Biswas,et al.  Unsupervised Learning with Mixed Numeric and Nominal Data , 2002, IEEE Trans. Knowl. Data Eng..

[24]  Jinxian Lin,et al.  A density-based clustering over evolving heterogeneous data stream , 2009, 2009 ISECS International Colloquium on Computing, Communication, Control, and Management.

[25]  Michal Daszykowski,et al.  Revised DBSCAN algorithm to cluster data with dense adjacent clusters , 2013 .

[26]  Siripen Wikaisuksakul,et al.  A multi-objective genetic algorithm with fuzzy c-means for automatic data clustering , 2014, Appl. Soft Comput..

[27]  Liang Bai,et al.  A dissimilarity measure for the k-Modes clustering algorithm , 2012, Knowl. Based Syst..

[28]  Chung-Chian Hsu,et al.  Incremental clustering of mixed data based on distance hierarchy , 2008, Expert Syst. Appl..

[29]  Xiangfeng Luo,et al.  Discovering the core semantics of event from social media , 2016, Future Gener. Comput. Syst..

[30]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[31]  Daniel A. Keim,et al.  A General Approach to Clustering in Large Databases with Noise , 2003, Knowledge and Information Systems.

[32]  Witold Pedrycz,et al.  The fuzzy C-means algorithm with fuzzy P-mode prototypes for clustering objects having mixed features , 2009, Fuzzy Sets Syst..

[33]  Ujjwal Maulik,et al.  Integrating Clustering and Supervised Learning for Categorical Data Analysis , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.