Studying the performance of clustering techniques for biomedical data using spark

The size and growth of data from many sources has been brought us to the generation of big data era where the volume and amount of data cannot be computed and processed using conventional or traditional environment. To process and compute the big data there are lots of big data environments are developed such as Hadoop using Distributed File System (HDFS) and MapReduce framework. Apache Spark is the newly developed framework that can be used with hadoop and it is built and run on top of it. In this paper, we study the performance of different clustering techniques of data mining using spark. The clustering algorithms used are K-means, Bisecting K-means and Gaussian Mixture models. Our design uses Spark Resilient Distributed Datasets to store the colorectal cancer patient's data in HDFS. The results show that the algorithms yield a different set of clusters and a fact common to all algorithms is that the average survival months of a patient decreases as the cluster center increases.

[1]  Reynold Xin,et al.  SparkR: Scaling R Programs with Spark , 2016, SIGMOD Conference.

[2]  Kiran Bhowmick,et al.  A MapReduce based approach for classification , 2016, 2016 Online International Conference on Green Engineering and Technologies (IC-GET).

[3]  Shen Bin,et al.  Research on data mining models for the internet of things , 2010, 2010 International Conference on Image Analysis and Signal Processing.

[4]  Murtaza Haider,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[5]  Ishwarappa,et al.  A Brief Introduction on Big Data 5Vs Characteristics and Hadoop Technology , 2015 .

[6]  Meenu Dave,et al.  Different clustering algorithms for Big Data analytics: A review , 2016, 2016 International Conference System Modeling & Advancement in Research Trends (SMART).

[7]  Davide Anguita,et al.  Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf , 2015, INNS Conference on Big Data.

[8]  Lin Li,et al.  A Fast Heuristic Attribute Reduction Algorithm Using Spark , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[9]  Christophe Nicolle,et al.  Understandable Big Data: A survey , 2015, Comput. Sci. Rev..

[10]  B. Arathi Amended Data Mining for Various Internets of Things Applications , 2017 .

[11]  Laurence T. Yang,et al.  Data Mining for Internet of Things: A Survey , 2014, IEEE Communications Surveys & Tutorials.

[12]  Hai Jin,et al.  A distributed SVM method based on the iterative MapReduce , 2015, Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015).

[13]  Min Soo Kang,et al.  Clustering performance comparison using K-means and expectation maximization algorithms , 2014, Biotechnology, biotechnological equipment.

[14]  Jason J. Jung,et al.  Social big data: Recent achievements and new challenges , 2015, Information Fusion.

[15]  Won-Ki Jeong,et al.  GPU in-Memory Processing Using Spark for Iterative Computation , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[16]  Sergio Ramírez-Gallego,et al.  Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark , 2017, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[17]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[18]  E. A. Mary Anita,et al.  A Survey of Big Data Analytics in Healthcare and Government , 2015 .

[19]  Jae-Gil Lee,et al.  PAMAE: Parallel k-Medoids Clustering with High Accuracy and Efficiency , 2017, KDD.

[20]  Chengzhong Xu,et al.  Performance Modeling for Spark Using SVM , 2016, 2016 7th International Conference on Cloud Computing and Big Data (CCBD).

[21]  Ibrar Yaqoob,et al.  Big IoT Data Analytics: Architecture, Opportunities, and Open Research Challenges , 2017, IEEE Access.

[22]  Weiming Shen,et al.  A user behavior prediction model based on parallel neural network and k-nearest neighbor algorithms , 2017, Cluster Computing.

[23]  Osman Hegazy,et al.  A mapreduce fuzzy techniques of big data classification , 2016, 2016 SAI Computing Conference (SAI).

[24]  Justine Rochas,et al.  K Nearest Neighbour Joins for Big Data on MapReduce: A Theoretical and Experimental Analysis , 2016, IEEE Transactions on Knowledge and Data Engineering.

[25]  M. Anwar Ma'sum,et al.  Design of intelligent k-means based on spark for big data clustering , 2016, 2016 International Workshop on Big Data and Information Security (IWBIS).