A Case for Embedded FPGA-based SoCs in Energy-Efficient Acceleration of Graph Problems

Sparse graph problems are notoriously hard to accelerate on conventional platforms due to irregular memory access patterns resulting in underutilization of memory bandwidth. These bottlenecks on traditional x86-based systems mean that sparse graph problems scale very poorly, both in terms of performance and power efficiency. A cluster of embedded SoCs systems-on-chip with closely-coupled FPGA accelerators can support distributed memory access with better matched low-power processing. We first conduct preliminary experiments across a range of COTS commercial off-the-shelf embedded SoCs to establish promise for energy-efficiency acceleration of sparse problems. We select the Xilinx Zynq SoC with FPGA accelerators to construct a prototype 32 node Beowulf cluster. We develop specialized MPI routines and memory DMA offload engines to support irregular communication efficiently. In this setup, we use the ARM processor as a data marshaller for local DMA traffic as well as remote MPI traffic while the FPGA may be used as a programmable accelerator. Across a set of benchmark graphs, we show that 32-node embedded SoC cluster can exceed the energy efficiency of an Intel E5-2407 by as much as 1.7 at a total graph processing capacity of 91-95 MTEPS for graphs as large as 32 million nodes and edges.

[1]  Dieter Kranzlmüller,et al.  Towards Energy Efficient Parallel Computing on Consumer Electronic Devices , 2011, ICT-GLOW.

[2]  F. Piazza,et al.  Low power high-performance computingon the Beagleboard platform , 2012, 2012 5th European DSP Education and Research Conference (EDERC).

[3]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[4]  Alejandro Rico,et al.  Tibidabo: Making the case for an ARM-based HPC system , 2014, Future Gener. Comput. Syst..

[5]  Nachiket Kapre,et al.  Zedwulf: Power-Performance Tradeoffs of a 32-Node Zynq SoC Cluster , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[6]  B. Bollobás The evolution of random graphs , 1984 .

[7]  Nachiket Kapre Custom FPGA-based soft-processors for sparse graph acceleration , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[8]  Mateo Valero,et al.  Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Steven J. Johnston,et al.  Iridis-pi: a low-cost, compact demonstration cluster , 2014, Cluster Computing.

[10]  Nachiket Kapre,et al.  GraphMMU: Memory Management Unit for Sparse Graph Accelerators , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.