Scalable machine‐learning algorithms for big data analytics: a comprehensive review

Big data analytics is one of the emerging technologies as it promises to provide better insights from huge and heterogeneous data. Big data analytics involves selecting the suitable big data storage and computational framework augmented by scalable machine‐learning algorithms. Despite the tremendous buzz around big data analytics and its advantages, an extensive literature survey focused on parallel data‐intensive machine‐learning algorithms for big data has not been conducted so far. The present paper provides a comprehensive overview of various machine‐learning algorithms used in big data analytics. The present work is an attempt to identify the gaps in the work already performed by researchers, thus paving the way for further quality research in parallel scalable algorithms for big data. WIREs Data Mining Knowl Discov 2016, 6:194–214. doi: 10.1002/widm.1194

[1]  Dilpreet Singh,et al.  A survey on platforms for big data analytics , 2014, Journal of Big Data.

[2]  Sharath Chandra Guntuku,et al.  Big Data Analytics framework for Peer-to-Peer Botnet detection using Random Forests , 2014, Inf. Sci..

[3]  Wei Dai,et al.  A MapReduce Implementation of C4.5 Decision Tree Algorithm , 2014 .

[4]  Dale Schuurmans,et al.  MapReduce for Parallel Reinforcement Learning , 2011, EWRL.

[5]  Alok N. Choudhary,et al.  A Scalable Hierarchical Clustering Algorithm Using Spark , 2015, 2015 IEEE First International Conference on Big Data Computing Service and Applications.

[6]  Faouzi Boufarès,et al.  Scalable Massively Parallel Learning of Multiple Linear Regression Algorithm with MapReduce , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[7]  Sateesh K. Peddoju,et al.  Classification and comparison of NoSQL big data models , 2015, Int. J. Big Data Intell..

[8]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[9]  Xindong Wu,et al.  MReC4.5: C4.5 Ensemble Classification with MapReduce , 2009, 2009 Fourth ChinaGrid Annual Conference.

[10]  Scott Shenker,et al.  Fast and Interactive Analytics over Hadoop Data with Spark , 2012, login Usenix Mag..

[11]  D Jayalatchumy IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA , 2014 .

[12]  Jimmy J. Lin MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail! , 2012, Big Data.

[13]  Christos Doulkeridis,et al.  A survey of large-scale analytical query processing in MapReduce , 2013, The VLDB Journal.

[14]  Zhen Xiao,et al.  LIBRA: Lightweight Data Skew Mitigation in MapReduce , 2015, IEEE Transactions on Parallel and Distributed Systems.

[15]  Hai Jin,et al.  Handling partitioning skew in MapReduce using LEEN , 2013, Peer Peer Netw. Appl..

[16]  何耀彬,et al.  MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data , 2013 .

[17]  Bo Zhu,et al.  CLUS: Parallel Subspace Clustering Algorithm on Spark , 2015, ADBIS.

[18]  Yanheng Liu,et al.  A scalable random forest algorithm based on MapReduce , 2013, 2013 IEEE 4th International Conference on Software Engineering and Service Science.

[19]  Taghi M. Khoshgoftaar,et al.  A survey of open source tools for machine learning with big data in the Hadoop ecosystem , 2015, Journal of Big Data.

[20]  Likewin Thomas,et al.  Application of Parallel K-Means Clustering Algorithm for Prediction of Optimal Path in Self Aware Mobile Ad-Hoc Networks with Link Stability , 2011, ACC.

[21]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[22]  Geoffrey Fox,et al.  Study on Parallel SVM Based on MapReduce , 2012 .

[23]  Yugang Dai,et al.  The naive Bayes text classification algorithm based on rough set in the cloud platform , 2014 .

[24]  Benjamin W. Wah,et al.  Significance and Challenges of Big Data Research , 2015, Big Data Res..

[25]  J. Khairnar,et al.  Sentiment Analysis Based Mining and Summarizing Using SVM-MapReduce , 2014 .

[26]  E. Brewer,et al.  CAP twelve years later: How the "rules" have changed , 2012, Computer.

[27]  Yonggang Zhang,et al.  Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop , 2015, Int. J. Distributed Sens. Networks.

[28]  Athanasios V. Vasilakos,et al.  Big data analytics: a survey , 2015, Journal of Big Data.

[29]  Suhail Sami,et al.  Extract Five Categories CPIVW from the 9V's Characteristics of the Big Data , 2016 .

[30]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[31]  A. B. M. Moniruzzaman NewSQL: Towards Next-Generation Scalable RDBMS for Online Transaction Processing (OLTP) for Big Data Management , 2014, ArXiv.

[32]  Francisco Herrera,et al.  A MapReduce Approach to Address Big Data Classification Problems Based on the Fusion of Linguistic Fuzzy Rules , 2015, Int. J. Comput. Intell. Syst..

[33]  Nathalie Japkowicz,et al.  Big Data Analysis: New Algorithms for a New Society , 2015 .

[34]  Shouyang Wang,et al.  A New Back-Propagation Neural Network Algorithm for a Big Data Environment Based on Punishing Characterized Active Learning Strategy , 2013, Int. J. Knowl. Syst. Sci..

[35]  K. Bakshi,et al.  Considerations for big data: Architecture and approach , 2012, 2012 IEEE Aerospace Conference.

[36]  Vasile PURDIL,et al.  MR-Tree-A Scalable MapReduce Algorithm for Building Decision Trees , 2014 .

[37]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[38]  Enda Barrett,et al.  A parallel framework for Bayesian reinforcement learning , 2014, Connect. Sci..

[39]  Saeed Shahrivari,et al.  Beyond Batch Processing: Towards Real-Time and Streaming Big Data , 2014, Comput..

[40]  Xue-wen Chen,et al.  Large-Scale Deep Belief Nets With MapReduce , 2014, IEEE Access.

[41]  Christophe Salperwyck,et al.  CourboSpark: Decision Tree for Time-series on Spark , 2015, AALTD@PKDD/ECML.

[42]  Syed Akhter Hossain,et al.  NoSQL Database: New Era of Databases for Big data Analytics - Classification, Characteristics and Comparison , 2013, ArXiv.

[43]  John H. Williams,et al.  Design and Implementation of Programming Languages , 1977 .

[44]  Maozhen Li,et al.  The Parallelization of Back Propagation Neural Network in MapReduce and Spark , 2016, International Journal of Parallel Programming.

[45]  R. Campbell Two Case Studies , 1998 .

[46]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[47]  ZhangHai-Jun,et al.  Parallel implementation of multilayered neural networks based on Map-Reduce on cloud computing clusters , 2016 .

[48]  S. R,et al.  Data Mining with Big Data , 2017, 2017 11th International Conference on Intelligent Systems and Control (ISCO).

[49]  Weizhong Yan,et al.  p-PIC: Parallel power iteration clustering for big data , 2013, J. Parallel Distributed Comput..

[50]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[51]  Christos Faloutsos,et al.  Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.

[52]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[53]  Veda C. Storey,et al.  Business Intelligence and Analytics: From Big Data to Big Impact , 2012, MIS Q..

[54]  Xue-wen Chen,et al.  Big Data Deep Learning: Challenges and Perspectives , 2014, IEEE Access.

[55]  Nan-Feng Xiao,et al.  Parallel implementation of multilayered neural networks based on Map-Reduce on cloud computing clusters , 2015, Soft Computing.

[56]  Ke Xu,et al.  A MapReduce based Parallel SVM for Email Classification , 2014, J. Networks.

[57]  Genshe Chen,et al.  Scalable sentiment classification for Big Data analysis using Naïve Bayes Classifier , 2013, 2013 IEEE International Conference on Big Data.