SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP , the first SpMV library for real PIM architectures.Wemake three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core, including (1) various compressed matrix formats, (2) load balancing schemes across parallel threads and (3) synchronization approaches, and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to an Intel Xeon CPU and an NVIDIA Tesla V100 GPU to study the performance and energy efficiency of various devices, i.e., both memory-centric PIM systems and conventional processor-centric CPU/GPU systems, for the SpMV kernel. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, i.e., CSR, COO, BCSR and BCOO, and a wide range of data types. SparseP is publicly and freely available at https://github.com/CMU-SAFARI/SparseP. Our extensive evaluation using 26 matrices with various sparsity patterns provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems.

[1]  James Demmel,et al.  Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[2]  Torsten Hoefler,et al.  SlimSell: A Vectorizable Graph Representation for Breadth-First Search , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[3]  Tianshi Chen,et al.  Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[5]  Mattan Erez,et al.  Near Data Acceleration with Concurrent Host Access , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[6]  Nectarios Koziris,et al.  Perfomance Models for Blocked Sparse Matrix-Vector Multiplication Kernels , 2009, 2009 International Conference on Parallel Processing.

[7]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[8]  Rachata Ausavarungnirun,et al.  Enabling Practical Processing in and near Memory for Data-Intensive Computing , 2019, DAC.

[9]  Onur Mutlu,et al.  LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory , 2017, IEEE Computer Architecture Letters.

[10]  Peter M. Kogge,et al.  Scalability of Hybrid Sparse Matrix Dense Vector (SpMV) Multiplication , 2018, 2018 International Conference on High Performance Computing & Simulation (HPCS).

[11]  Onur Mutlu,et al.  DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks , 2021, IEEE Access.

[12]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[13]  Srinivasan Parthasarathy,et al.  Efficient sparse-matrix multi-vector product on GPUs , 2018, HPDC.

[14]  Nectarios Koziris,et al.  Understanding the Performance of Sparse Matrix-Vector Multiplication , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[15]  Guangming Tan,et al.  TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs , 2021, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[16]  Carole-Jean Wu,et al.  The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[17]  Rob H. Bisseling,et al.  Communication balancing in parallel sparse matrix-vector multiplication , 2005 .

[18]  Onur Mutlu,et al.  Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-In-Memory Hardware , 2021, 2021 12th International Green and Sustainable Computing Conference (IGSC).

[19]  N. Koziris,et al.  Conflict-free symmetric sparse matrix-vector multiplication on multicore architectures , 2019, SC.

[20]  Shengen Yan,et al.  yaSpMV: yet another SpMV framework on GPUs , 2014, PPoPP.

[21]  Brian Vinter,et al.  An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[22]  Bahar Asgari,et al.  ALRESCHA: A Lightweight Reconfigurable Sparse-Computation Accelerator , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[23]  Nectarios Koziris,et al.  Optimizing sparse matrix-vector multiplication using index and value compression , 2008, CF '08.

[24]  Dimin Niu,et al.  iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[25]  Rahul Boyapati,et al.  Active-Routing: Compute on the Way for Near-Data Processing , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[26]  Yun Liang,et al.  Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[27]  Rachata Ausavarungnirun,et al.  Processing Data Where It Makes Sense: Enabling In-Memory Computation , 2019, Microprocess. Microsystems.

[28]  Eric S. Chung,et al.  A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[29]  D FalgoutRobert An Introduction to Algebraic Multigrid , 2006 .

[30]  Onur Mutlu,et al.  Improving DRAM performance by parallelizing refreshes with accesses , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[31]  Udo W. Pooch,et al.  A Survey of Indexing Techniques for Sparse Matrices , 1973, CSUR.

[32]  Minsoo Rhu,et al.  TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning , 2019, MICRO.

[33]  Shin-Dug Kim,et al.  Functionality-Based Processing-in-Memory Accelerator for Deep Convolutional Neural Networks , 2021, IEEE Access.

[34]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[35]  Martin D. Schatz,et al.  RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[36]  Rachata Ausavarungnirun,et al.  CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[37]  Oscar Plata,et al.  NATSA: A Near-Data Processing Accelerator for Time Series Analysis , 2020, 2020 IEEE 38th International Conference on Computer Design (ICCD).

[38]  Rachata Ausavarungnirun,et al.  RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[39]  Aamer Jaleel,et al.  ExTensor: An Accelerator for Sparse Tensor Algebra , 2019, MICRO.

[40]  Bora Uçar,et al.  Semi-two-dimensional Partitioning for Parallel Sparse Matrix-Vector Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[41]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[42]  Nectarios Koziris,et al.  CSX: an extended compression format for spmv on shared memory systems , 2011, PPoPP '11.

[43]  Weixing Ji,et al.  Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms , 2020, Int. J. High Perform. Comput. Appl..

[44]  Onur Mutlu,et al.  SIMDRAM: a framework for bit-serial SIMD processing using DRAM , 2020, ASPLOS.

[45]  David Blaauw,et al.  OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[46]  Pen-Chung Yew,et al.  Variable-Sized Blocks for Locality-Aware SpMV , 2021, 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[47]  Maurice Herlihy,et al.  Concurrent Data Structures with Near-Data-Processing: an Architecture-Aware Implementation , 2019, SPAA.

[48]  Tanya Y. Berger-Wolf,et al.  AdELL: An Adaptive Warp-Balancing ELL Format for Efficient Sparse Matrix-Vector Multiplication on GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[49]  Hans-Peter Seidel,et al.  Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU , 2017, ICS.

[50]  Andrew Lumsdaine,et al.  Accelerating sparse matrix computations via data compression , 2006, ICS '06.

[51]  Jiajia Li,et al.  Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture , 2018, ACM Trans. Math. Softw..

[52]  Eriko Nurvitadhi,et al.  A sparse matrix vector multiply accelerator for support vector machine , 2015, 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[53]  Fabio Checconi,et al.  ALTO: adaptive linearized storage of sparse tensors , 2021, ICS.

[54]  Nectarios Koziris,et al.  SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms , 2018, ACM Trans. Math. Softw..

[55]  Onur Mutlu,et al.  Processing-in-memory: A workload-driven perspective , 2019, IBM J. Res. Dev..

[56]  Shoaib Kamil,et al.  Taco: A tool to generate tensor algebra kernels , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[57]  Rajeev Balasubramonian,et al.  OrderLight: Lightweight Memory-Ordering Primitive for Efficient Fine-Grained PIM Computations , 2021, MICRO.

[58]  Yu Wang,et al.  GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[59]  Ninghui Sun,et al.  SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.

[60]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[61]  Kenli Li,et al.  Performance Optimization Using Partitioned SpMV on GPUs and Multicore CPUs , 2015, IEEE Transactions on Computers.

[62]  Yanzhi Wang,et al.  GraphQ: Scalable PIM-Based Graph Processing , 2019, MICRO.

[63]  Fabrice Devaux,et al.  The true Processing In Memory accelerator , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[64]  Feng Yan,et al.  Efficient PageRank and SpMV Computation on AMD GPUs , 2010, 2010 39th International Conference on Parallel Processing.

[65]  Donghyuk Lee,et al.  Near-memory data transformation for efficient sparse matrix multi-vector multiplication , 2019, SC.

[66]  Eriko Nurvitadhi,et al.  Fine-grained accelerators for sparse machine learning workloads , 2017, 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC).

[67]  Eitan Grinspun,et al.  Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, SIGGRAPH Courses.

[68]  Jack Dongarra,et al.  Sparse Matrix Libraries in C++ for High Performance Architectures , 1997 .

[69]  Sudhakar Yalamanchili,et al.  Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[70]  Nectarios Koziris,et al.  SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures , 2021, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[71]  Kenli Li,et al.  A hybrid computing method of SpMV on CPU-GPU heterogeneous computing systems , 2017, J. Parallel Distributed Comput..

[72]  Franz Franchetti,et al.  Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[73]  Xin Liu,et al.  Towards Efficient SpMV on Sunway Manycore Architectures , 2018, ICS.

[74]  Maurice Herlihy,et al.  Concurrent Data Structures for Near-Memory Computing , 2017, SPAA.

[75]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[76]  Onur Mutlu,et al.  A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[77]  Nathan Beckmann,et al.  Livia: Data-Centric Computing Throughout the Memory Hierarchy , 2020, ASPLOS.

[78]  Lei Deng,et al.  SpaceA: Sparse Matrix Vector Multiplication on Processing-in-Memory Accelerator , 2021, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[79]  Hyun Jin Moon,et al.  Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure , 2005, HPCC.

[80]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[81]  Dipankar Das,et al.  SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[82]  Nectarios Koziris,et al.  Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[83]  Ngai Wong,et al.  Design space exploration for sparse matrix-matrix multiplication on FPGAs , 2010, 2010 International Conference on Field-Programmable Technology.

[84]  Kenli Li,et al.  Performance Analysis and Optimization for SpMV on GPU Using Probabilistic Modeling , 2015, IEEE Transactions on Parallel and Distributed Systems.

[85]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[86]  Pascal Hénon,et al.  PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems , 2002, Parallel Comput..

[87]  Kishore Kothapalli,et al.  Architecture- and workload- aware heterogeneous algorithms for sparse matrix vector multiplication , 2014, COMPUTE '14.

[88]  Michele Martone,et al.  Efficient multithreaded untransposed, transposed or symmetric sparse matrix-vector multiplication with the Recursive Sparse Blocks format , 2014, Parallel Comput..

[89]  Greg Linden,et al.  Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[90]  Kashif Nizam Khan,et al.  RAPL in Action: Experiences in Using RAPL for Power Measurements , 2020 .

[91]  Parallel Hash Table Design for NDP Systems , 2020, MEMSYS.

[92]  Robert D. Falgout,et al.  hypre: A Library of High Performance Preconditioners , 2002, International Conference on Computational Science.

[93]  Eriko Nurvitadhi,et al.  Hardware accelerator for analytics of sparse data , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[94]  Michael Garland,et al.  Merge-Based Parallel Sparse Matrix-Vector Multiplication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[95]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[96]  Kurt Keutzer,et al.  clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs , 2012, ICS '12.

[97]  Magnus Jahre,et al.  An energy efficient column-major backend for FPGA SpMV accelerators , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[98]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[99]  Onur Mutlu,et al.  Chapter Four - Simple Operations in Memory to Reduce Data Movement , 2017, Adv. Comput..

[100]  Rachata Ausavarungnirun,et al.  A Modern Primer on Processing in Memory , 2020, ArXiv.

[101]  Rudolf Eigenmann,et al.  Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems , 2008, ICS '08.

[102]  Kenli Li,et al.  Optimization of quasi-diagonal matrix–vector multiplication on GPU , 2014, Int. J. High Perform. Comput. Appl..

[103]  Rok Sosic,et al.  SNAP , 2016, ACM Trans. Intell. Syst. Technol..

[104]  Ping Guo,et al.  A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[105]  Onur Mutlu,et al.  Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture , 2021, ArXiv.

[106]  Calvin J. Ribbens,et al.  Pattern-based sparse matrix representation for memory-efficient SMVM kernels , 2009, ICS.

[107]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[108]  Y. Saad,et al.  Krylov Subspace Methods on Supercomputers , 1989 .

[109]  Srinivasan Parthasarathy,et al.  Automatic Selection of Sparse Matrix Representation on GPUs , 2015, ICS.

[110]  Samuel Williams,et al.  Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[111]  Wayne Luk,et al.  Accelerating SpMV on FPGAs by Compressing Nonzero Values , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[112]  O Seongil,et al.  Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[113]  Tze Meng Low,et al.  Efficient SpMV Operation for Large and Highly Sparse Matrices using Scalable Multi-way Merge Parallelization , 2019, MICRO.

[114]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[115]  Onur Mutlu,et al.  Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[116]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[117]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[118]  Bahar Asgari,et al.  Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads , 2021, 2021 IEEE International Symposium on Workload Characterization (IISWC).

[119]  Babak Falsafi,et al.  The mondrian data engine , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[120]  Feng Shi,et al.  Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[121]  Marcin Paprzycki,et al.  On BLAS Operations with Recursively Stored Sparse Matrices , 2010, 2010 12th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[122]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, SIAM Conference on Parallel Processing for Scientific Computing.

[123]  Mattan Erez,et al.  Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators , 2020, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[124]  Yue Zhao,et al.  Bridging the gap between deep learning and sparse matrix format selection , 2018, PPoPP.

[125]  Dominique Lavenier,et al.  Variant Calling Parallelization on Processor-in-Memory Architecture , 2020, bioRxiv.

[126]  Pavel Tvrdík,et al.  Evaluation Criteria for Sparse Matrix Storage Formats , 2016, IEEE Transactions on Parallel and Distributed Systems.

[127]  Sander Stuijk,et al.  Near-Memory Computing: Past, Present, and Future , 2019, Microprocess. Microsystems.

[128]  Jung Ho Ahn,et al.  Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[129]  Nectarios Koziris,et al.  Performance evaluation of the sparse matrix-vector multiplication on modern architectures , 2009, The Journal of Supercomputing.

[130]  Onur Mutlu,et al.  Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[131]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[132]  Beata Bylina,et al.  Performance analysis of multicore and multinodal implementation of SpMV operation , 2014, 2014 Federated Conference on Computer Science and Information Systems.

[133]  J. Navarro-Pedreño Numerical Methods for Least Squares Problems , 1996 .

[134]  Arutyun Avetisyan,et al.  Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.

[135]  Jung Ho Ahn,et al.  TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory , 2021, MICRO.

[136]  Onur Mutlu,et al.  SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations , 2019, MICRO.

[137]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[138]  Xiaosong Ma,et al.  Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[139]  Yue Zhao,et al.  Overhead-Conscious Format Selection for SpMV-Based Applications , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[140]  Ramyad Hadidi,et al.  GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).