Melia: A MapReduce Framework on OpenCL-Based FPGAs

MapReduce, originally developed by Google for search applications, has recently become a popular programming framework for parallel and distributed environments. This paper presents an energy-efficient architecture design for MapReduce on Field Programmable Gate Arrays (FPGAs). The major goal is to enable users to program FPGAs with simple MapReduce interfaces, and meanwhile to embrace automatic performance optimizations within the MapReduce framework. Compared to other processors like CPUs and GPUs, FPGAs are (re-)programmable hardware and have very low energy consumption. However, the design and implementation of MapReduce on FPGAs can be challenging: firstly, FPGAs are usually programmed with hardware description languages, which hurts the programmability of the MapReduce design to its users; secondly, since MapReduce has irregular access patterns (especially in the reduce phase) and needs to support user-defined functions, careful designs and optimizations are required for efficiency. In this paper, we design, implement and evaluate Melia, a MapReduce framework on FPGAs. Melia takes advantage of the recent OpenCL programming framework developed for Altera FPGAs, and abstracts FPGAs behind the simple and familiar MapReduce interfaces in C. We further develop a series of FPGA-centric optimization techniques to improve the efficiency of Melia, and a costand resource-based approach to automate the parameter settings for those optimizations. We evaluate Melia on a recent Altera Stratix V GX FPGA with a number of commonly used MapReduce benchmarks. Our results demonstrate that 1) the efficiency and effectiveness of our optimizations and automated parameter setting approach, 2) Melia can achieve promising energy efficiency in comparison with its counterparts on CPUs/GPUs on both single-FPGA and cluster settings.

[1]  Bingsheng He,et al.  MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors , 2015, IEEE Transactions on Parallel and Distributed Systems.

[2]  W. Bohm,et al.  The Case for Dynamic Execution on Dynamic Hardware , 2007 .

[3]  Peter Lee,et al.  An FPGA implementation of a flexible, parallel image processing architecture suitable for embedded vision systems , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[4]  Gustavo Alonso,et al.  Streams on Wires - A Query Compiler for FPGAs , 2009, Proc. VLDB Endow..

[5]  Dmitri Loguinov,et al.  On the performance of MapReduce: A stochastic approach , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[6]  Yan Solihin,et al.  Modeling and Analyzing Key Performance Factors of Shared Memory MapReduce , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[7]  Gustavo Alonso,et al.  FPGA acceleration for the frequent item problem , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[8]  丸山 勉,et al.  Field Programmable Gate Array による複雑適応系の計算の高速化 , 1999 .

[9]  Gagan Agrawal,et al.  Accelerating MapReduce on a coupled CPU-GPU architecture , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Ron Sass,et al.  Reconfigurable Computing Cluster (RCC) Project: Investigating the Feasibility of FPGA-Based Petascale Computing , 2007 .

[11]  Gerhard Fettweis,et al.  An application-specific instruction set for accelerating set-oriented database primitives , 2014, SIGMOD Conference.

[12]  John D. Davis,et al.  BLAS Comparison on FPGA, CPU and GPU , 2010, 2010 IEEE Computer Society Annual Symposium on VLSI.

[13]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[14]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[15]  Russell Tessier,et al.  FPGA Architecture: Survey and Challenges , 2008, Found. Trends Electron. Des. Autom..

[16]  Seyong Lee,et al.  PUMA: Purdue MapReduce Benchmarks Suite , 2012 .

[17]  Jim Tørresen,et al.  Migrating Static Systems to Partially Reconfigurable Systems on Spartan-6 FPGAs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[18]  Gustavo Alonso,et al.  Complex event detection at wire speed with FPGAs , 2010, Proc. VLDB Endow..

[19]  Beng Chin Ooi,et al.  Distributed data management using MapReduce , 2014, CSUR.

[20]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[21]  Wei Jiang,et al.  MATE-CG: A Map Reduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[22]  Wenguang Chen,et al.  MapCG: Writing parallel program portable between CPU and GPU , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23]  Wei Zhang,et al.  A performance analysis framework for optimizing OpenCL applications on FPGAs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[24]  Randy H. Katz,et al.  Greening the Switch , 2008, HotPower.

[25]  Doris Chen,et al.  Fractal video compression in OpenCL: An evaluation of CPUs, GPUs, and FPGAs as acceleration platforms , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[26]  Gagan Agrawal,et al.  Optimizing MapReduce for GPUs with effective shared memory usage , 2012, HPDC '12.

[27]  Yu Wang,et al.  FPMR: MapReduce framework on FPGA , 2010, FPGA '10.

[28]  Philip Heng Wai Leong,et al.  Map-reduce as a Programming Model for Custom Computing Machines , 2008, 2008 16th International Symposium on Field-Programmable Custom Computing Machines.

[29]  Martin Margala,et al.  High Level Programming for Heterogeneous Architectures , 2014, ArXiv.

[30]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[31]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[32]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[33]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[34]  Implementing FPGA Design with the OpenCL Standard , 2010 .

[35]  Gustavo Alonso,et al.  A flexible hash table design for 10GBPS key-value stores on FPGAS , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[36]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[37]  Jignesh M. Patel,et al.  Towards Energy-Efficient Database Cluster Design , 2012, Proc. VLDB Endow..

[38]  Kunle Olukotun,et al.  Hardware acceleration of database operations , 2014, FPGA.

[39]  Gustavo Alonso,et al.  Ibex - An Intelligent Storage Engine with Support for Advanced SQL Off-loading , 2014, Proc. VLDB Endow..

[40]  Wei Zhang,et al.  A study of data partitioning on OpenCL-based FPGAs , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[41]  Jens Teubner,et al.  Data Processing on FPGAs , 2013, Proc. VLDB Endow..

[42]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[43]  Jim Costabile Hardware Acceleration for Map / Reduce Analysis of Streaming Data Using OpenCL , 2015 .

[44]  Jens Teubner,et al.  FPGA: what's in it for a database? , 2009, SIGMOD Conference.

[45]  Bingsheng He,et al.  Mars: Accelerating MapReduce with Graphics Processors , 2011, IEEE Transactions on Parallel and Distributed Systems.

[46]  Jens Teubner,et al.  How soccer players would do stream joins , 2011, SIGMOD '11.

[47]  W. Luk,et al.  Axel: a heterogeneous cluster with FPGAs and GPUs , 2010, FPGA '10.