Apache Hama: An Emerging Bulk Synchronous Parallel Computing Framework for Big Data Applications

In today’s highly intertwined network society, the demand for big data processing frameworks is continuously growing. The widely adopted model to process big data is parallel and distributed computing. This paper documents the significant progress achieved in the field of distributed computing frameworks, particularly Apache Hama, a top level project under the Apache Software Foundation, based on bulk synchronous parallel processing. The comparative studies and empirical evaluations performed in this paper reveal Hama’s potential and efficacy in big data applications. In particular, we present a benchmark evaluation of Hama’s graph package and Apache Giraph using PageRank algorithm. The results show that the performance of Hama is better than Giraph in terms of scalability and computational speed. However, despite great progress, a number of challenging issues continue to inhibit the full potential of Hama to be used at large scale. This paper also describes these challenges, analyzes solutions proposed to overcome them, and highlights research opportunities.

[1]  Dilpreet Singh,et al.  A survey on platforms for big data analytics , 2014, Journal of Big Data.

[2]  Zahir Tari,et al.  A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis , 2014, IEEE Transactions on Emerging Topics in Computing.

[3]  Vito Giovanni Castellana,et al.  In-Memory Graph Databases for Web-Scale Data , 2015, Computer.

[4]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[5]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[6]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[7]  Kevin J. Nowka,et al.  Second-Generation Big Data Systems , 2015, Computer.

[8]  Rajkumar Buyya,et al.  Heads-Join: Efficient Earth Mover's Distance Similarity Joins on Hadoop , 2016, IEEE Transactions on Parallel and Distributed Systems.

[9]  Tom Fawcett,et al.  Authors' Response to Gong's, "Comment on Data Science and its Relationship to Big Data and Data-Driven Decision Making" , 2014, Big Data.

[10]  Carlos E. Otero,et al.  Research Directions for Engineering Big Data Analytics Software , 2015, IEEE Intelligent Systems.

[11]  María José del Jesús,et al.  Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[12]  Alberto Montresor,et al.  An evaluation study of BigData frameworks for graph processing , 2013, 2013 IEEE International Conference on Big Data.

[13]  Jin-Soo Kim,et al.  HAMA: An Efficient Matrix Computation with the MapReduce Framework , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[14]  Saeed Shahrivari,et al.  Beyond Batch Processing: Towards Real-Time and Streaming Big Data , 2014, Comput..

[15]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[16]  Pangfeng Liu,et al.  Kylin: An efficient and scalable graph data processing system , 2013, 2013 IEEE International Conference on Big Data.

[17]  Yang Liu,et al.  Implementation of a parallel graph partition algorithm to speed up BSP computing , 2014, 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[18]  Yonggang Wen,et al.  Toward Scalable Systems for Big Data Analytics: A Technology Tutorial , 2014, IEEE Access.

[19]  Chen Wang,et al.  Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics , 2015, Proc. VLDB Endow..

[20]  Chen-Shu Wang,et al.  Constructing a Cloud Computing Based Social Networks Data Warehousing and Analyzing System , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[21]  Binyu Zang,et al.  Computation and communication efficient graph processing with distributed immutable view , 2014, HPDC '14.

[22]  Shailendra W. Shende,et al.  Parallel K-Means Clustering Based on Hadoop and Hama , 2014 .

[23]  Ge Yu,et al.  A BSP-Based Parallel Iterative Processing System with Multiple Partition Strategies for Big Graphs , 2013, 2013 IEEE International Congress on Big Data.

[24]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..