A comprehensive study on clustering approaches for big data mining

Technological advancement has enabled us to store and process huge amount of data items in a relatively much lesser span of time. The term "Big Data" simply refers to huge amount of data nowadays used frequently in industrial and research circles. The focus point here is not just the collection of data but careful analysis of the collected data so that meaningful results can be obtained. There are various ways of handling the huge incoming streams of data. One such way is clustering of data into compact units. This not only reduces the size of the data but also helps to utilize it in a more effective manner. This paper gives an overview and comparison of basic clustering algorithms, and suggests the suitability of clustering approaches for various sizes of data sets. A brief introduction to evolution of the clustering algorithms is also given.

[1]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[2]  Sung-Hyon Myaeng,et al.  Initializing K-Means using Genetic Algorithms , 2009 .

[3]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[4]  José Ruiz-Shulcloper,et al.  A clustering method for very large mixed data sets , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[5]  Naren Ramakrishnan,et al.  Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[6]  J.W. Lockwood,et al.  Streaming Hierarchical Clustering for Concept Mining , 2007, 2007 IEEE Aerospace Conference.

[7]  D.V. Patil,et al.  A Hybrid Evolutionary Approach To Construct Optimal Decision Trees With Large Data Sets , 2006, 2006 IEEE International Conference on Industrial Technology.

[8]  Shuliang Wang,et al.  Data Field for Hierarchical Clustering , 2011, Int. J. Data Warehous. Min..

[9]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[10]  Jeffrey D. Ullman,et al.  Mining of Massive Datasets: Mining Data Streams , 2011 .

[11]  Edward C. Uberbacher,et al.  Analyzing large biological datasets with association networks , 2012, Nucleic acids research.

[12]  Chin-Shyurng Fahn,et al.  Hierarchical Artificial Neural Networks for Recognizing High Similar Large Data Sets , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[13]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[14]  Nitesh V. Chawla,et al.  Decision tree learning on very large data sets , 1998, SMC.

[15]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.