Substream-Centric Maximum Matchings on FPGA

Developing high-performance and energy-efficient algorithms for maximum matchings is becoming increasingly important in social network analysis, computational sciences, scheduling, and others. In this work, we propose the first maximum matching algorithm designed for FPGAs; it is energy-efficient and has provable guarantees on accuracy, performance, and storage utilization. To achieve this, we forego popular graph processing paradigms, such as vertex-centric programming, that often entail large communication costs. Instead, we propose a substream-centric approach, in which the input stream of data is divided into substreams processed independently to enable more parallelism while lowering communication costs. We base our work on the theory of streaming graph algorithms and analyze 14 models and 28 algorithms. We use this analysis to provide theoretical underpinning that matches the physical constraints of FPGA platforms. Our algorithm delivers high performance (more than 4× speedup over tuned parallel CPU variants), low memory, high accuracy, and effective usage of FPGA resources. The substream-centric approach could easily be extended to other algorithms to offer low-power and high-performance graph processing on FPGAs.

[1]  Yu Wang,et al.  ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture , 2017, FPGA.

[2]  Torsten Hoefler,et al.  Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis , 2019, FPGA.

[3]  Mark E. J. Newman A measure of betweenness centrality based on random walks , 2005, Soc. Networks.

[4]  Graham Cormode,et al.  The Sparse Awakens: Streaming Algorithms for Matching Size Estimation in Sparse Graphs , 2016, ESA.

[5]  Torsten Hoefler,et al.  Survey and Taxonomy of Lossless Graph Compression and Space-Efficient Graph Representations , 2018, ArXiv.

[6]  Hayden Kwok-Hay So,et al.  GraVF: A vertex-centric distributed graph processing framework on FPGAs , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[7]  Dejan Markovic,et al.  A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs , 2014, FPGA.

[8]  Douglas J. Klein,et al.  On some solved and unsolved problems of chemical graph theory , 1986 .

[9]  Torsten Hoefler,et al.  Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations , 2015, ICS.

[10]  Viktor K. Prasanna,et al.  Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform , 2017, 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[11]  Elkin Garcia,et al.  A Reconfigurable Computing System Based on a Cache-Coherent Fabric , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[12]  Yu Wang,et al.  Parallel FPGA-based all pairs shortest paths for sparse networks: A human brain connectome case study , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[13]  Jing Li,et al.  Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform , 2018, FPGA.

[14]  Nanning Zheng,et al.  Stereo Matching Using Belief Propagation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Sudipto Guha,et al.  Linear programming in the semi-streaming model with application to the maximum matching problem , 2011, Inf. Comput..

[16]  Sanjeev Arora,et al.  The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..

[17]  Luca Benini,et al.  Network-accelerated non-contiguous memory transfers , 2019, SC.

[18]  James C. Hoe,et al.  GraphGen for CoRAM : Graph Computation on FPGAs , 2013 .

[19]  Samson Zhou,et al.  Streaming Weighted Matchings: Optimal Meets Greedy , 2016, ArXiv.

[20]  A. Kemper,et al.  On Graph Problems in a Semi-streaming Model , 2015 .

[21]  Sofya Vorotnikova,et al.  Planar Matching in Streams Revisited , 2016, APPROX-RANDOM.

[22]  Tianshi Chen,et al.  TuNao: A High-Performance and Energy-Efficient Reconfigurable Accelerator for Graph Processing , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[23]  Avery Ching,et al.  One Trillion Edges: Graph Processing at Facebook-Scale , 2015, Proc. VLDB Endow..

[24]  Ashish Goel,et al.  On the communication and streaming complexity of maximum bipartite matching , 2012, SODA.

[25]  Mayur Datar,et al.  On the streaming model augmented with a sorting primitive , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[26]  Mariano Zelke,et al.  Weighted Matching in the Semi-Streaming Model , 2007, Algorithmica.

[27]  Torsten Hoefler,et al.  Evaluating the Cost of Atomic Operations on Modern Architectures , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[28]  Hayden Kwok-Hay So,et al.  Vertex-Centric Graph Processing on FPGA , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[29]  Phillip H. Jones,et al.  CyGraph: A Reconfigurable Architecture for Parallel Breadth-First Search , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[30]  Reynold Xin,et al.  Apache Spark , 2016 .

[31]  Reuven Bar-Yehuda,et al.  A unified approach to approximating resource allocation and scheduling , 2001, JACM.

[32]  Torsten Hoefler,et al.  Fault tolerance for remote memory access programming models , 2014, HPDC '14.

[33]  Hossam A. ElGindy,et al.  On sparse matrix-vector multiplication with FPGA-based system , 2002, Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[34]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[35]  Torsten Hoefler,et al.  Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems, and Parallelism , 2019, ArXiv.

[36]  Torsten Hoefler,et al.  Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries , 2019, ArXiv.

[37]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[38]  Richard M. Karp,et al.  An optimal algorithm for on-line bipartite matching , 1990, STOC '90.

[39]  Torsten Hoefler,et al.  SlimSell: A Vectorizable Graph Representation for Breadth-First Search , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[40]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[41]  James C. Hoe,et al.  GraphGen: An FPGA Framework for Vertex-Centric Graph Computation , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[42]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[43]  Viktor K. Prasanna,et al.  High-Throughput and Energy-Efficient Graph Processing on FPGA , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[44]  Pengcheng Yao,et al.  An efficient graph accelerator with parallel data conflict management , 2018, PACT.

[45]  Yu Wang,et al.  FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search , 2016, FPGA.

[46]  Gustavo Alonso,et al.  Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures , 2017, SIGMOD Conference.

[47]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[48]  Thomas Schank,et al.  Algorithmic Aspects of Triangle-Based Network Analysis , 2007 .

[49]  Torsten Hoefler,et al.  Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication , 2019, SC.

[50]  Torsten Hoefler,et al.  Communication-avoiding parallel minimum cuts and connected components , 2018, PPoPP.

[51]  Jing Li,et al.  Accelerating Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC Platform , 2018, FPGA.

[52]  Torsten Hoefler,et al.  High-Performance Distributed RMA Locks , 2016, HPDC.

[53]  Sofya Vorotnikova,et al.  Kernelization via Sampling with Applications to Finding Matchings and Related Problems in Dynamic Graph Streams , 2016, SODA.

[54]  F. Massey,et al.  Introduction to Statistical Analysis , 1970 .

[55]  Torsten Hoefler,et al.  Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results , 2017 .

[56]  Yong Dou,et al.  An FPGA Implementation for Solving the Large Single-Source-Shortest-Path Problem , 2016, IEEE Transactions on Circuits and Systems II: Express Briefs.

[57]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[58]  Kiyoung Choi,et al.  ExtraV: Boosting Graph Processing Near Storage with a Coherent Accelerator , 2017, Proc. VLDB Endow..

[59]  Torsten Hoefler,et al.  A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[60]  Jennifer Widom,et al.  Optimizing Graph Algorithms on Pregel-like Systems , 2014, Proc. VLDB Endow..

[61]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[62]  Jing Li,et al.  Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search , 2017, FPGA.

[63]  Nachiket Kapre Custom FPGA-based soft-processors for sparse graph acceleration , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[64]  Gustavo Alonso,et al.  Centaur: A Framework for Hybrid CPU-FPGA Databases , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[65]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[66]  Viktor K. Prasanna,et al.  Optimizing memory performance for FPGA implementation of pagerank , 2015, 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[67]  Sudipto Guha,et al.  Analyzing graph structure via linear measurements , 2012, SODA.

[68]  Klaus Jansen,et al.  Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques , 2006, Lecture Notes in Computer Science.

[69]  Torsten Hoefler,et al.  Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages , 2015, HPDC.

[70]  Sofya Vorotnikova,et al.  A Simple, Space-Efficient, Streaming Algorithm for Matchings in Low Arboricity Graphs , 2018, SOSA@SODA.

[71]  Jim Stevens,et al.  Run-Time Services for Hybrid CPU/FPGA Systems on Chip , 2006, 2006 27th IEEE International Real-Time Systems Symposium (RTSS'06).

[72]  Magnus Jahre,et al.  Hybrid breadth-first search on a single-chip FPGA-CPU heterogeneous platform , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[73]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[74]  Axel Jantsch,et al.  Buffer minimization of real-time streaming applications scheduling on hybrid CPU/FPGA architectures , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[75]  Sofya Vorotnikova,et al.  Better Algorithms for Counting Triangles in Data Streams , 2016, PODS.

[76]  Derek Chiou,et al.  FPGA-Accelerated Transactional Execution of Graph Workloads , 2017, FPGA.

[77]  Viktor K. Prasanna,et al.  An FPGA framework for edge-centric graph processing , 2018, CF.

[78]  Torsten Hoefler,et al.  Slim NoC: A Low-Diameter On-Chip Network Topology for High Energy Efficiency and Scalability , 2018, ASPLOS.

[79]  Torsten Hoefler,et al.  Slim graph: practical lossy graph compression for approximate graph processing, storage, and analytics , 2019, SC.

[80]  Uzi Vishkin,et al.  An O(log n) Parallel Connectivity Algorithm , 1982, J. Algorithms.

[81]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[82]  Mikhail Kapralov,et al.  Better bounds for matchings in the streaming model , 2012, SODA.

[83]  Yao-Wen Chang,et al.  Graph matching-based algorithms for FPGA segmentation design , 1998, 1998 IEEE/ACM International Conference on Computer-Aided Design. Digest of Technical Papers (IEEE Cat. No.98CB36287).

[84]  Li Shang,et al.  Dynamic power consumption in Virtex™-II FPGA family , 2002, FPGA '02.

[85]  Hugo Liu,et al.  ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[86]  Arijit Khan Vertex-Centric Graph Processing: Good, Bad, and the Ugly , 2017, EDBT.

[87]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[88]  Claire Mathieu,et al.  Maximum Matching in Semi-streaming with Few Passes , 2011, APPROX-RANDOM.

[89]  Torsten Hoefler,et al.  Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication , 2016, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[90]  MutluOnur,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015 .

[91]  Wilfred Ng,et al.  Pregel Algorithms for Graph Connectivity Problems with Performance Guarantees , 2014, Proc. VLDB Endow..

[92]  Kunle Olukotun,et al.  GraphOps: A Dataflow Library for Graph Analytics Acceleration , 2016, FPGA.

[93]  Michael Isard,et al.  Scalability! But at what COST? , 2015, HotOS.

[94]  Richard Szeliski,et al.  A Comparative Study of Energy Minimization Methods for Markov Random Fields with Smoothness-Based Priors , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[95]  Ami Paz,et al.  A (2 + ∊)-Approximation for Maximum Weight Matching in the Semi-Streaming Model , 2017, SODA.

[96]  Dror Rawitz,et al.  Local ratio: A unified framework for approximation algorithms. In Memoriam: Shimon Even 1935-2004 , 2004, CSUR.

[97]  Martin Langhammer,et al.  Arria™ 10 device architecture , 2015, 2015 IEEE Custom Integrated Circuits Conference (CICC).

[98]  John Shalf,et al.  Programming Abstractions for Data Locality , 2014 .

[99]  Taieb Znati,et al.  Algorithmic Aspects of Wireless Networks , 2007, EURASIP J. Wirel. Commun. Netw..

[100]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[101]  Jeff Mason,et al.  CHiMPS: A C-level compilation flow for hybrid CPU-FPGA architectures , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[102]  Camil Demetrescu,et al.  Trading off space for passes in graph streaming problems , 2009, SODA '06.

[103]  Gregory D. Peterson,et al.  Sparse Matrix-Vector Multiplication Design on FPGAs , 2007, 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007).

[104]  Christian Sohler,et al.  Counting triangles in data streams , 2006, PODS.

[105]  Chengbo Yang,et al.  An Efficient Dispatcher for Large Scale GraphProcessing on OpenCL-based FPGAs , 2018, ArXiv.

[106]  Torsten Hoefler,et al.  Graph Processing on FPGAs: Taxonomy, Survey, Challenges , 2019, ArXiv.

[107]  Franz Franchetti,et al.  Mathematical foundations of the GraphBLAS , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[108]  Torsten Hoefler,et al.  Transformations of High-Level Synthesis Codes for High-Performance Computing , 2018, IEEE Transactions on Parallel and Distributed Systems.

[109]  Michael Crouch,et al.  Improved Streaming Algorithms for Weighted Matching, via Unweighted Matching , 2014, APPROX-RANDOM.

[110]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.

[111]  Gunnar Rätsch,et al.  Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons , 2019, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[112]  Sanjeev Khanna,et al.  Approximating matching size from random streams , 2014, SODA.

[113]  Reuven Bar-Yehuda,et al.  A Local-Ratio Theorem for Approximating the Weighted Vertex Cover Problem , 1983, WG.

[114]  Leah Epstein,et al.  Improved Approximation Guarantees for Weighted Matching in the Semi-streaming Model , 2009, SIAM J. Discret. Math..

[115]  Torsten Hoefler,et al.  To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations , 2017, HPDC.

[116]  Charu C. Aggarwal,et al.  Evolutionary Network Analysis , 2014, ACM Comput. Surv..

[117]  Graham Cormode,et al.  Annotations in Data Streams , 2009, ICALP.

[118]  Tim J. Harris,et al.  A survey of PRAM simulation techniques , 1994, CSUR.

[119]  David L. Andrews,et al.  Extending the thread programming model across cpu and fpga hybrid architectures , 2005 .

[120]  Peter J. Ashenden,et al.  Programming models for hybrid CPU/FPGA chips , 2004, Computer.

[121]  Pascal Benoit,et al.  Run-time mapping and communication strategies for Homogeneous NoC-Based MPSoCs , 2007 .

[122]  Wayne Luk,et al.  A framework for FPGA acceleration of large graph problems: Graphlet counting case study , 2011, 2011 International Conference on Field-Programmable Technology.

[123]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[124]  Nachiket Kapre,et al.  GraphStep: A System Architecture for Sparse-Graph Algorithms , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[125]  Andrew McGregor,et al.  Finding Graph Matchings in Data Streams , 2005, APPROX-RANDOM.

[126]  AngryCalc GeForce GTX 1080 Ti , 2018 .

[127]  Peter J. Ashenden,et al.  Programming models for hybrid FPGA-cpu computational components: a missing link , 2004, IEEE Micro.

[128]  Ra Inta,et al.  The "Chimera": An Off-The-Shelf CPU/GPGPU/FPGA Hybrid Computing Platform , 2012, Int. J. Reconfigurable Comput..

[129]  Zhi-Zhong Chen,et al.  Parallel approximation algorithms for maximum weighted matching in general graphs , 2000, Inf. Process. Lett..

[130]  Viktor K. Prasanna,et al.  Sparse Matrix-Vector multiplication on FPGAs , 2005, FPGA '05.

[131]  Yogesh L. Simmhan,et al.  GoFFish: A Sub-graph Centric Framework for Large-Scale Graph Analytics , 2013, Euro-Par.

[132]  Ozcan Ozturk,et al.  Energy Efficient Architecture for Graph Analytics Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[133]  Yang Li,et al.  Maximum Matchings in Dynamic Graph Streams and the Simultaneous Communication Model , 2016, SODA.

[134]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[135]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[136]  Torsten Hoefler,et al.  Enabling highly-scalable remote memory access programming with MPI-3 one sided , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[137]  Sotirios G. Ziavras,et al.  Performance-Energy Tradeoffs for Matrix Multiplication on FPGA-Based Mixed-Mode Chip Multiprocessors , 2007, 8th International Symposium on Quality Electronic Design (ISQED'07).

[138]  Aranyak Mehta,et al.  Online bipartite matching with unknown distributions , 2011, STOC '11.

[139]  Yu Wang,et al.  A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[140]  Sudipto Guha,et al.  Graph sketches: sparsification, spanners, and subgraphs , 2012, PODS.

[141]  Torsten Hoefler,et al.  Log(graph): a near-optimal high-performance graph representation , 2018, PACT.

[142]  Torsten Hoefler,et al.  Substream-Centric Maximum Matchings on FPGA , 2019, FPGA.

[143]  Jason Cong,et al.  A quantitative analysis on microarchitectures of modern CPU-FPGA platforms , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[144]  Graham Cormode,et al.  Independent Sets in Vertex-Arrival Streams , 2018, ICALP.