Performance Enhancement of Distributed Clustering for Big Data Analytics

Big Data analytics are recently coming up as prominent research area in the field of data science. Apache Spark is an open source distributed data processing platform that uses distributed memory abstraction to process large volume of streaming data efficiently. Performance improvement of analytic computational model of streaming big data is important to meet the requirements of many real-time data analysis. Researchers focus on Analytic algorithm improvement to reduce analysis time. This paper presents performance enhancement of in-memory computational model by selecting the most important attributes after caching data to Apache spark. Performance analysis of distributed K-Means clustering algorithm based on in-memory computational model has been conducted. The results show improvement in the performance of the model.

[1]  Andrew Guthrie Ferguson,et al.  Big Data and Predictive Reasonable Suspicion , 2014 .

[2]  Veda C. Storey,et al.  Business Intelligence and Analytics: From Big Data to Big Impact , 2012, MIS Q..

[3]  Mohamed Elemam Shehab,et al.  Effective Selection of Machine Learning Algorithms for Big Data Analytics Using Apache Spark , 2016, AISI.

[4]  Johnny S. Wong,et al.  A Brief Review on Leading Big Data Models , 2014, Data Sci. J..

[5]  Srikanta Patnaik,et al.  Leading NoSQL models for handling Big Data: a brief review , 2016, Int. J. Bus. Inf. Syst..

[6]  Muhammad Shiraz,et al.  Big Data: Survey, Technologies, Opportunities, and Challenges , 2014, TheScientificWorldJournal.

[7]  Ernst C. Osinga,et al.  Big Data and Data Science Methods for Management Research , 2016 .

[8]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[9]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[10]  Rajkumar Buyya,et al.  Big Data computing and clouds: Trends and future directions , 2013, J. Parallel Distributed Comput..

[11]  Reynold Xin,et al.  Scaling Spark in the Real World: Performance and Usability , 2015, Proc. VLDB Endow..

[12]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[13]  Saint John Walker Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2014 .

[14]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[15]  Yongzhao Zhan,et al.  Maximum Neighborhood Margin Discriminant Projection for Classification , 2014, TheScientificWorldJournal.

[16]  Marlon E. Pierce,et al.  Integrating Apache Airavata with Docker, Marathon, and Mesos , 2016, Concurr. Comput. Pract. Exp..

[17]  Manole Velicanu,et al.  Perspectives on Big Data and Big Data Analytics , 2012 .

[18]  Tariq Rahim Soomro,et al.  Big Data Analysis: Apache Spark Perspective , 2015 .