An Improved MapReduce Design of Kmeans with Iteration Reducing for Clustering Stock Exchange Very Large Datasets

This paper targets the problem of clustering very large datasets as one of the most challenging tasks for data mining and processing. We propose an improved MapReduce design of Kmeans algorithm with an iteration reducing method. Experiments show that this method reduces the number of iterations and the execution time of the Kmeans algorithm while keeping 80% of the clustering accuracy. The employment of MapReduce programming paradigm and iterations reducing techniques offers the possibility to process the huge volume of data generated by stock exchanges daily transactions which performs a better decision making by analysts.

[1]  Keqiu Li,et al.  Optimized big data K-means clustering using MapReduce , 2014, The Journal of Supercomputing.

[2]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[3]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[4]  Ying Wah Teh,et al.  Big Data Clustering: A Review , 2014, ICCSA.

[5]  M. K. Tiwari,et al.  Clustering Indian stock market data for portfolio management , 2010, Expert Syst. Appl..

[7]  Silke Wagner,et al.  Comparing Clusterings - An Overview , 2007 .

[8]  Anjan K. Koundinya,et al.  MapReduce Design of K-Means Clustering Algorithm , 2013, 2013 International Conference on Information Science and Applications (ICISA).

[9]  Phayung Meesad,et al.  Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method , 2014, KSE.

[10]  Saeed Shahrivari,et al.  High performance parallel $$k$$k-means clustering for disk-resident datasets on multi-core CPUs , 2014, The Journal of Supercomputing.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[13]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[14]  Aruna Tiwari,et al.  Handling Big Data with Fuzzy Based Classification Approach , 2013, WCSC.

[15]  Agma J. M. Traina,et al.  Open issues for partitioning clustering methods: an overview , 2014, WIREs Data Mining Knowl. Discov..

[16]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.