A survey on platforms for big data analytics

The primary purpose of this paper is to provide an in-depth analysis of different platforms available for performing big data analytics. This paper surveys different hardware platforms available for big data analytics and assesses the advantages and drawbacks of each of these platforms based on various metrics such as scalability, data I/O rate, fault tolerance, real-time processing, data size supported and iterative task support. In addition to the hardware, a detailed description of the software frameworks used within each of these platforms is also discussed along with their strengths and drawbacks. Some of the critical characteristics described here can potentially aid the readers in making an informed decision about the right choice of platforms depending on their computational needs. Using a star ratings table, a rigorous qualitative comparison between different platforms is also discussed for each of the six characteristics that are critical for the algorithms of big data analytics. In order to provide more insights into the effectiveness of each of the platform in the context of big data analytics, specific implementation level details of the widely used k-means clustering algorithm on various platforms are also described in the form pseudocode.

[1]  Peter Alan Lee,et al.  Fault Tolerance , 1990, Dependable Computing and Fault-Tolerant Systems.

[2]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[3]  李 昌桓 Amazon Elastic MapReduceテクニカルガイド : クラウド型Hadoopで実現する大規模分散処理 : technical guide , 2012 .

[4]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[5]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[6]  Tim Kraska,et al.  MLbase: A Distributed Machine-learning System , 2013, CIDR.

[7]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[8]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[9]  Philip R. Moorby,et al.  The Verilog Hardware Description Language, 5th Edition , 2002 .

[10]  Yanfeng Zhang,et al.  iMapReduce: A Distributed Computing Framework for Iterative Computation , 2011, IPDPS Workshops.

[11]  D. Milojicic,et al.  Peer-to-Peer Computing , 2010 .

[12]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[13]  Peter Boncz,et al.  First International Workshop on Graph Data Management Experiences and Systems , 2013, SIGMOD 2013.

[14]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[15]  Donald W. Bouldin Impacting education using FPGAs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[16]  Indranil Palit,et al.  Scalable and Parallel Boosting with MapReduce , 2012, IEEE Transactions on Knowledge and Data Engineering.

[17]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[18]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[19]  Karl-Erwin Großpietsch,et al.  Fault tolerance , 1994, IEEE Micro.

[20]  Klaus Wehrle,et al.  Peer-to-Peer Systems and Applications , 2005, Peer-to-Peer Systems and Applications.

[21]  Bingsheng He,et al.  Parallel Data Mining on Graphics Processors , 2011 .

[22]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[23]  Sean Owen,et al.  Mahout in Action , 2011 .

[24]  Vijay Srinivas Agneeswaran,et al.  Paradigms for Realizing Machine Learning Algorithms , 2013, Big Data.

[25]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[26]  Yu Chen,et al.  A Survey on the Application of FPGAs for Network Infrastructure Security , 2011, IEEE Communications Surveys & Tutorials.

[27]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[28]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[29]  Charu C. Aggarwal,et al.  Data Clustering , 2013 .

[30]  John Langford,et al.  Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[33]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[34]  Bhanukiran Vinzamuri,et al.  A Survey of Partitional and Hierarchical Clustering Algorithms , 2018, Data Clustering: Algorithms and Applications.

[35]  Eric Monmasson,et al.  FPGAs in Industrial Control Applications , 2011, IEEE Transactions on Industrial Informatics.

[36]  Donald E. Thomas,et al.  The Verilog® Hardware Description Language , 1990 .

[37]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[38]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[39]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[40]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[41]  Miguel Castro,et al.  Proceedings of the 8th ACM European Conference on Computer Systems , 2013 .

[42]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[43]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[44]  Rajkumar Buyya,et al.  High Performance Cluster Computing: Architectures and Systems , 1999 .

[45]  Guy Lohman,et al.  Proceedings of the 4th annual Symposium on Cloud Computing , 2013, SoCC 2013.

[46]  Henri Casanova,et al.  A Simple MPI Process Swapping Architecture for Iterative Applications , 2004, Int. J. High Perform. Comput. Appl..

[47]  Kuruvilla Varghese,et al.  A Scalable High Throughput Firewall in FPGA , 2008, 2008 16th International Symposium on Field-Programmable Custom Computing Machines.