An Enhanced k-Means Clustering Algorithm for Pattern Discovery in Healthcare Data

The huge amounts of data generated by media sensors in health monitoring systems, by medical diagnosis that produce media (audio, video, image, and text) content, and from health service providers are too complex and voluminous to be processed and analyzed by traditional methods. Data mining approaches offer the methodology and technology to transform these heterogeneous data into meaningful information for decision making. This paper studies data mining applications in healthcare. Mainly, we study k-means clustering algorithms on large datasets and present an enhancement to k-means clustering, which requires k or a lesser number of passes to a dataset. The proposed algorithm, which we call G-means, utilizes a greedy approach to produce the preliminary centroids and then takes k or lesser passes over the dataset to adjust these center points. Our experimental results, which were used in an increasing manner on the same dataset, show that G-means outperforms k-means in terms of entropy and F-scores. The experiments also yield better results for G-means in terms of the coefficient of variance and the execution time.

[1]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[2]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[3]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[4]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[5]  Barry Litman Predicting Success of Theatrical Movies: An Empirical Study , 1983 .

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[8]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[9]  Dmitriy Fradkin,et al.  Experiments with random projections for machine learning , 2003, KDD '03.

[10]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[11]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[12]  Nagiza F. Samatova,et al.  RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets , 2002, Distributed and Parallel Databases.

[13]  Gilbert L. Peterson,et al.  Scaling ant colony optimization with hierarchical reinforcement learning partitioning , 2012, GECCO '08.

[14]  Silvia Nittel,et al.  Scaling clustering algorithms for massive data sets using data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[15]  Mark Polczynski,et al.  Using the k-Means Clustering Algorithm to Classify Features for Choropleth Maps , 2014 .

[16]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[17]  Bin Zhang,et al.  Distributed data clustering can be efficient and exact , 2000, SKDD.

[18]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[19]  Nashat Mansour,et al.  An auto-indexing method for Arabic text , 2008, Inf. Process. Manag..

[20]  Srinivasan Parthasarathy,et al.  Clustering Distributed Homogeneous Datasets , 2000, PKDD.

[21]  Norbert Vanbeselcaere,et al.  Social Identity Theory: Constructive and Critical Advances , 1994 .

[22]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[23]  Rajeev Srivastava,et al.  k-means Based Document Clustering with Automatic "k" Selection and Cluster Refinement , 2014 .

[24]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[25]  Ramzi A. Haraty,et al.  Regression Test Selection for Database Applications , 2004, Advanced Topics in Database Research, Vol. 3.

[26]  Keng Siau,et al.  Advanced Topics In Database Research , 2005 .

[27]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[28]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[29]  H. Koh,et al.  Data mining applications in healthcare. , 2005, Journal of healthcare information management : JHIM.

[30]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[31]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[32]  Bruce Austin,et al.  Motivations for movie attendance , 1986 .

[33]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.