NDS: N-Dimensional Storage

Demands for efficient computing among applications that use high-dimensional datasets have led to multi-dimensional computers—computers that leverage heterogeneous processors/accelerators offering various processing models to support multi-dimensional compute kernels. Yet the front-end for these processors/accelerators is inefficient, as memory/storage systems often expose only entrenched linear-space abstractions to an application, and they often ignore the benefits of modern memory/storage systems, such as support for multi-dimensionality through different types of parallel access. This paper presents N-Dimensional Storage (NDS), a novel, multi-dimensional memory/storage system that fulfills the demands of modern hardware accelerators and applications. NDS abstracts memory arrays as native storage that applications can use to describe data locations and uses coordinates in any application-defined multi-dimensional space, thereby avoiding the software overhead associated with data-object transformations. NDS gauges the application demand underlying memory-device architectures in order to intelligently determine the physical data layout that maximizes access bandwidth and minimizes the overhead of presenting objects for arbitrary applications. This paper demonstrates an efficient architecture in supporting NDS. We evaluate a set of linear/tensor algebra workloads along with graph and data-mining algorithms on custom-built systems using each architecture. Our result shows a 5.73 × speedup with appropriate architectural support.

[1]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[2]  John Shalf,et al.  DRAM-Less: Hardware Acceleration of Data Processing with New Memory , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[3]  Yang Liu,et al.  Hippogriff: Efficiently moving data in heterogeneous computing systems , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[4]  Anima Anandkumar,et al.  Tensor Contractions with Extended BLAS Kernels on CPU and GPU , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[5]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[6]  Jinyoung Lee,et al.  Biscuit: A Framework for Near-Data Processing of Big Data Workloads , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[7]  Bojan Mrazovac,et al.  Performance evaluation of using Protocol Buffers in the Internet of Things communication , 2016, 2016 International Conference on Smart Systems and Technologies (SST).

[8]  John F. Stanton,et al.  A massively parallel tensor contraction framework for coupled-cluster computations , 2014, J. Parallel Distributed Comput..

[9]  John Shalf,et al.  NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[10]  Rachata Ausavarungnirun,et al.  The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[11]  Chanik Park,et al.  Enabling cost-effective data processing with smart SSD , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[12]  Song Han,et al.  SpArch: Efficient Architecture for Sparse Matrix Multiplication , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13]  Aamer Jaleel,et al.  ExTensor: An Accelerator for Sparse Tensor Algebra , 2019, MICRO.

[14]  John Thompson,et al.  Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[15]  Jimeng Sun,et al.  An input-adaptive and in-place approach to dense tensor-times-matrix multiply , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Vivek Sarkar,et al.  Automatic data layout generation and kernel mapping for CPU+GPU architectures , 2016, CC.

[17]  H. Howie Huang,et al.  G-Store: High-Performance Graph Store for Trillion-Edge Processing , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Marcin Zukowski,et al.  DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing , 2008, DaMoN '08.

[19]  John R. Gilbert,et al.  On the representation and multiplication of hypersparse matrices , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[20]  Javier González,et al.  LightNVM: The Linux Open-Channel SSD Subsystem , 2017, FAST.

[21]  Dipankar Das,et al.  SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[22]  S. Reinhardt,et al.  AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing , 2019, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Tianshi Chen,et al.  Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Steven Swanson,et al.  Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[25]  Paolo Bientinesi,et al.  Design of a High-Performance GEMM-like Tensor–Tensor Multiplication , 2016, ACM Trans. Math. Softw..

[26]  Andrea C. Arpaci-Dusseau,et al.  Towards an Unwritten Contract of Intel Optane SSD , 2019, HotStorage.

[27]  Yuan Xie,et al.  Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs , 2019, MICRO.

[28]  Yu-Ching Hu,et al.  Dynamic Multi-Resolution Data Storage , 2019, MICRO.

[29]  Jun Yang,et al.  A durable and energy efficient main memory using phase change memory technology , 2009, ISCA '09.

[30]  James R. Larus,et al.  Persona: A High-Performance Bioinformatics Framework , 2017, USENIX Annual Technical Conference.

[31]  W. Dally,et al.  SCNN , 2017 .

[32]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[33]  Devin A. Matthews,et al.  High-Performance Tensor Contraction without Transposition , 2016, SIAM J. Sci. Comput..

[34]  Onur Mutlu,et al.  Gather-Scatter DRAM: In-DRAM address translation to improve the spatial locality of non-unit strided accesses , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[35]  Peter Ahrens,et al.  Tensor Algebra Compilation with Workspaces , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[36]  Naehyuck Chang,et al.  PTL: PCM Translation Layer , 2012, 2012 IEEE Computer Society Annual Symposium on VLSI.

[37]  Myoungsoo Jung,et al.  TensorPRAM: Designing a Scalable Heterogeneous Deep Learning Accelerator with Byte-addressable PRAMs , 2020, HotStorage.

[38]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[39]  Nitish Srivastava,et al.  Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[40]  Gustavo Alonso,et al.  Ibex - An Intelligent Storage Engine with Support for Advanced SQL Off-loading , 2014, Proc. VLDB Endow..

[41]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[42]  Saman P. Amarasinghe,et al.  Format abstraction for sparse tensor algebra compilers , 2018, Proc. ACM Program. Lang..

[43]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[44]  David J. DeWitt,et al.  Query processing on smart SSDs: opportunities and challenges , 2013, SIGMOD '13.

[45]  Robert A. van de Geijn,et al.  An API for Manipulating Matrices Stored by Blocks ∗ Tze Meng Low , 2004 .

[46]  T. N. Vijaykumar,et al.  SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks , 2019, MICRO.

[47]  Joel H. Saltz,et al.  Active disks: programming model, algorithms and evaluation , 1998, ASPLOS VIII.

[48]  David E. Bernholdt,et al.  Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[49]  Yi Yang,et al.  Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[50]  Tamara G. Kolda,et al.  Efficient MATLAB Computations with Sparse and Factored Tensors , 2007, SIAM J. Sci. Comput..

[51]  Benoît Pradelle,et al.  Memory-efficient parallel tensor decompositions , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[52]  Benoît Meister,et al.  Optimization of symmetric tensor computations , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[53]  Tamara G. Kolda,et al.  Parallel Tensor Compression for Large-Scale Scientific Data , 2015, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[54]  Nitish Srivastava,et al.  MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[55]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[56]  Kiran Kumar Matam,et al.  GraphSSD: Graph Semantics Aware SSD , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[57]  Yang Liu,et al.  Willow: A User-Programmable SSD , 2014, OSDI.

[58]  Myoungsoo Jung,et al.  Flashabacus: a self-governing flash-based accelerator for low-power systems , 2018, EuroSys.

[59]  Henry M. Levy,et al.  Virtual Memory Management in the VAX/VMS Operating System , 1982, Computer.

[60]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[61]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[62]  John D. Owens,et al.  GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU , 2019, ACM Trans. Math. Softw..

[63]  Animesh Trivedi,et al.  Albis: High-Performance File Format for Big Data Systems , 2018, USENIX Annual Technical Conference.

[64]  Frank Nielsen,et al.  K-nearest neighbor search: Fast GPU-based implementations and application to high-dimensional feature matching , 2010, 2010 IEEE International Conference on Image Processing.

[65]  Sungjin Lee,et al.  BlueDBM: An appliance for Big Data analytics , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[66]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[67]  David A. Bader,et al.  Graph Partitioning and Graph Clustering, 10th DIMACS Implementation Challenge Workshop, Georgia Institute of Technology, Atlanta, GA, USA, February 13-14, 2012. Proceedings , 2013, Graph Partitioning and Graph Clustering.

[68]  Steven Swanson,et al.  Summarizer: Trading Communication with Computing Near Storage , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[69]  Bora Uçar,et al.  High Performance Parallel Algorithms for the Tucker Decomposition of Sparse Tensors , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[70]  Robert A. van de Geijn,et al.  The libflame Library for Dense Matrix Computations , 2009, Computing in Science & Engineering.

[71]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[72]  Jason Cong,et al.  RC-NVM: Dual-Addressing Non-Volatile Memory Architecture Supporting Both Row and Column Memory Accesses , 2019, IEEE Transactions on Computers.

[73]  Fan Yang,et al.  LFTF: A Framework for Efficient Tensor Analytics at Scale , 2017, Proc. VLDB Endow..

[74]  Ganesh G Surve,et al.  Parallel implementation of Bellman-ford algorithm using CUDA architecture , 2017, 2017 International conference of Electronics, Communication and Aerospace Technology (ICECA).

[75]  Kevin Skadron,et al.  HotSpot: a compact thermal modeling methodology for early-stage VLSI design , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[76]  Eun-Jin Im,et al.  Model-Based Memory Hierarchy Optimizations for Sparse Matrices , 2007 .

[77]  Victor Podlozhnyuk,et al.  Image Convolution with CUDA , 2007 .