A reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA

Sparse matrix-vector multiplication is often employed in many data-analytic workloads in which low latency and high throughput are more valuable than exact numerical convergence. FPGAs provide quick execution times while offering precise control over the accuracy of the results thanks to reduced-precision fixed-point arithmetic. In this work, we propose a novel streaming implementation of Coordinate Format (COO) sparse matrix-vector multiplication, and study its effectiveness when applied to the Personalized PageRank algorithm, a common building block of recommender systems in e-commerce websites and social networks. Our implementation achieves speedups up to 6x over a reference floating-point FPGA architecture and a state-of-the-art multi-threaded CPU implementation on 8 different data-sets, while preserving the numerical fidelity of the results and reaching up to 42x higher energy efficiency compared to the CPU implementation.

[1]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[2]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[3]  Paul Grigoras Instance directed tuning for sparse matrix kernels on reconfigurable accelerators , 2018 .

[4]  Wayne Luk,et al.  FP-BNN: Binarized neural network on FPGA , 2018, Neurocomputing.

[5]  Alan Said,et al.  Replicable Evaluation of Recommender Systems , 2015, RecSys.

[6]  Wayne Luk,et al.  Accelerating SpMV on FPGAs by Compressing Nonzero Values , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[7]  Magnus Jahre,et al.  A Vector Caching Scheme for Streaming FPGA SpMV Accelerators , 2015, ARC.

[8]  Sebastiano Vigna,et al.  PageRank: Functional dependencies , 2009, TOIS.

[9]  John D. Owens,et al.  GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU , 2019, ACM Trans. Math. Softw..

[10]  Yu Wang,et al.  FPGA and GPU implementation of large scale SpMV , 2010, 2010 IEEE 8th Symposium on Application Specific Processors (SASP).

[11]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Franz Franchetti,et al.  Mathematical foundations of the GraphBLAS , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[13]  Wenguang Chen,et al.  Gemini: A Computation-Centric Distributed Graph Processing System , 2016, OSDI.

[14]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[15]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[16]  Wayne Luk,et al.  Deep Neural Network Approximation for Custom Hardware , 2019, ACM Comput. Surv..

[17]  Ashish Goel,et al.  Fast Incremental and Personalized PageRank , 2010, Proc. VLDB Endow..

[18]  Shoaib Kamil,et al.  GraphIt: a high-performance graph DSL , 2018, Proc. ACM Program. Lang..

[19]  S. Reinhardt,et al.  AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing , 2019, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Tianqi Wang,et al.  UWB-GCN: Hardware Acceleration of Graph-Convolution-Network through Runtime Workload Rebalancing , 2019, ArXiv.

[21]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[22]  Guy Shani,et al.  Evaluating Recommendation Systems , 2011, Recommender Systems Handbook.

[23]  Jun Li,et al.  A Computational Trust Model in C2C E-Commerce Environment , 2010, 2010 IEEE 7th International Conference on E-Business Engineering.

[24]  Kunle Olukotun,et al.  Green-Marl: a DSL for easy and efficient graph analysis , 2012, ASPLOS XVII.

[25]  Ilse C. F. Ipsen,et al.  PageRank Computation, with Special Attention to Dangling Nodes , 2007, SIAM J. Matrix Anal. Appl..

[26]  Viktor K. Prasanna,et al.  Design and implementation of parallel PageRank on multicore platforms , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).