Athena: high-performance sparse tensor contraction sequence on heterogeneous memory

Sparse tensor contraction sequence has been widely employed in many fields, such as chemistry and physics. However, how to efficiently implement the sequence faces multiple challenges, such as redundant computations and memory operations, massive memory consumption, and inefficient utilization of hardware. To address the above challenges, we introduce Athena, a high-performance framework for SpTC sequences. Athena introduces new data structures, leverages emerging Optane-based heterogeneous memory (HM) architecture, and adopts stage parallelism. In particular, Athena introduces shared hash table-represented sparse accumulator to eliminate unnecessary input processing and data migration; Athena uses a novel data-semantic guided dynamic migration solution to make the best use of the Optane-based HM for high performance; Athena also co-runs execution phases with different characteristics to enable high hardware utilization. Evaluating with 12 datasets, we show that Athena brings 327-7362× speedup over the state-of-the-art SpTC algorithm. With the dynamic data placement guided by data semantics, Athena brings performance improvement on Optane-based HM over a state-of-the-art software-based data management solution, a hardware-based data management solution, and PMM-only by 1.58×, 1.82×, and 2.34× respectively. Athena also showcases its effectiveness in quantum chemistry and physics scenarios.

[1]  George Karypis,et al.  Accelerating the Tucker Decomposition with Compressed Sparse Tensors , 2017, Euro-Par.

[2]  Alfio Lazzaro,et al.  DBCSR: A Blocked Sparse Tensor Algebra Library , 2019, PARCO.

[3]  Evgeny Epifanovsky,et al.  New implementation of high‐level correlated methods using a general block tensor library for high‐performance electronic structure calculations , 2013, J. Comput. Chem..

[4]  Frank Neese,et al.  Sparse maps--A systematic infrastructure for reduced-scaling electronic structure methods. II. Linear scaling domain based pair natural orbital coupled cluster theory. , 2016, The Journal of chemical physics.

[5]  Minjia Zhang,et al.  Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning , 2021, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[6]  John F. Stanton,et al.  A massively parallel tensor contraction framework for coupled-cluster computations , 2014, J. Parallel Distributed Comput..

[7]  Keshav Pingali,et al.  Single machine graph analytics on massive datasets using Intel optane DC persistent memory , 2019, Proc. VLDB Endow..

[8]  Rachata Ausavarungnirun,et al.  Row buffer locality aware caching policies for hybrid memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[9]  Jie Ren,et al.  Exploring Non-Volatility of Non-Volatile Memory for High Performance Computing Under Failures , 2020, 2020 IEEE International Conference on Cluster Computing (CLUSTER).

[10]  Jimeng Sun,et al.  HiCOO: Hierarchical Storage of Sparse Tensors , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Kim Batselier,et al.  Faster Tensor Train Decomposition for Sparse Data , 2019, J. Comput. Appl. Math..

[12]  Ricardo Bianchini,et al.  Page placement in hybrid memory systems , 2011, ICS '11.

[13]  Sudarsun Kannan,et al.  Durable Transactional Memory Can Scale with Timestone , 2020, ASPLOS.

[14]  AutoHOOT: Automatic High-Order Optimization for Tensors , 2020, PACT.

[15]  Richard Veras,et al.  Analytical cache modeling and tilesize optimization for tensor contractions , 2019, SC.

[16]  Onur Mutlu,et al.  Panthera: holistic memory management for big data processing over hybrid memories , 2019, PLDI.

[17]  Daniel Kats,et al.  Sparse tensor framework for implementation of general local correlation methods. , 2013, The Journal of chemical physics.

[18]  Sachin Katti,et al.  Reducing DRAM footprint with NVM in Facebook , 2018, EuroSys.

[19]  Richard W. Vuduc,et al.  Load-Balanced Sparse MTTKRP on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[20]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[21]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[22]  Xu Liu,et al.  ATMem: adaptive data placement in graph applications on heterogeneous memories , 2020, CGO.

[23]  Ada Gavrilovska,et al.  HeteroOS — OS design for heterogeneous memory management in datacenter , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[24]  Adam Zalcman,et al.  TensorNetwork: A Library for Physics and Machine Learning , 2019, ArXiv.

[25]  James Demmel,et al.  Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[26]  E. M. Stoudenmire,et al.  The ITensor Software Library for Tensor Network Calculations , 2020, SciPost Physics Codebases.

[27]  Marcos K. Aguilera,et al.  AIFM: High-Performance, Application-Integrated Far Memory , 2020, OSDI.

[28]  Robert J. Harrison,et al.  Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure , 2020, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[29]  Hans-Joachim Werner,et al.  Parallel and Low-Order Scaling Implementation of Hartree-Fock Exchange Using Local Density Fitting. , 2016, Journal of chemical theory and computation.

[30]  Dong Li,et al.  Runtime Data Management on Non-Volatile Memory-based Heterogeneous Memory for Task-Parallel Programs , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  Samuel Williams,et al.  Auto-tuning performance on multicore computers , 2008 .

[32]  Dong Li,et al.  MD-HM: memoization-based molecular dynamics simulations on big memory system , 2021, ICS.

[33]  Justus A. Calvin,et al.  Massively Parallel Implementation of Explicitly Correlated Coupled-Cluster Singles and Doubles Using TiledArray Framework. , 2016, The journal of physical chemistry. A.

[34]  Bora Uçar,et al.  Parallel Candecomp/Parafac Decomposition of Sparse Tensors Using Dimension Trees , 2018, SIAM J. Sci. Comput..

[35]  Evgeny Epifanovsky,et al.  A General Sparse Tensor Framework for Electronic Structure Theory. , 2017, Journal of chemical theory and computation.

[36]  Richard W. Vuduc,et al.  Optimizing Sparse Tensor Times Matrix on Multi-core and Many-Core Architectures , 2016, 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3).

[37]  Sriram Krishnamoorthy,et al.  Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs , 2018, ICS.

[38]  Anand D. Sarwate,et al.  A Unified Optimization Approach for Sparse Tensor Operations on GPUs , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[39]  Taesoo Kim,et al.  Recipe: converting concurrent DRAM indexes to persistent-memory indexes , 2019, SOSP.

[40]  Jimeng Sun,et al.  Model-Driven Sparse CP Decomposition for Higher-Order Tensors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[41]  S. Hirata Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories , 2003 .

[42]  Sriram Krishnamoorthy,et al.  Performance optimization of tensor contraction expressions for many-body methods in quantum chemistry. , 2009, The journal of physical chemistry. A.

[43]  David E. Bernholdt,et al.  Automatic code generation for many-body electronic structure methods: the tensor contraction engine , 2006 .

[44]  Karsten Schwan,et al.  Data tiering in heterogeneous memory systems , 2016, EuroSys.

[45]  Andrzej Cichocki,et al.  Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions , 2014, ArXiv.

[46]  Dong Li,et al.  Processing-in-Memory for Energy-Efficient Neural Network Training: A Heterogeneous Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  Sriram Krishnamoorthy,et al.  An efficient mixed-mode representation of sparse tensors , 2019, SC.

[48]  Zi Yan,et al.  Nimble Page Management for Tiered Memory Systems , 2019, ASPLOS.

[49]  Jeffery S. Boschen,et al.  NWChem: Past, present, and future. , 2020, The Journal of chemical physics.

[50]  G. Karypis,et al.  A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization , 2016 .

[51]  T. Crawford,et al.  An Introduction to Coupled Cluster Theory for Computational Chemists , 2007 .

[52]  Jie Liu,et al.  Performance Analysis and Characterization of Training Deep Learning Models on Mobile Device , 2019, 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS).

[53]  Anima Anandkumar,et al.  Tensor Contractions with Extended BLAS Kernels on CPU and GPU , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[54]  Woongki Baek,et al.  Design and implementation of bandwidth-aware memory placement and migration policies for heterogeneous memory systems , 2017, ICS '17.

[55]  Sriram Krishnamoorthy,et al.  A Code Generator for High-Performance Tensor Contractions on GPUs , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[56]  Thomas F. Wenisch,et al.  Thermostat: Application-transparent Page Management for Two-tiered Main Memory , 2017, ASPLOS.

[57]  Steven Swanson,et al.  An Empirical Guide to the Behavior and Use of Scalable Persistent Memory , 2019, FAST.

[58]  Minjia Zhang,et al.  HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory , 2020, NeurIPS.

[59]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[60]  Jimeng Sun,et al.  Efficient and effective sparse tensor reordering , 2019, ICS.

[61]  Nikos D. Sidiropoulos,et al.  SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[62]  Yogish Sabharwal,et al.  On Optimizing Distributed Tucker Decomposition for Sparse Tensors , 2018, ICS.

[63]  Dong Li,et al.  Fast, flexible, and comprehensive bug detection for persistent memory programs , 2021, ASPLOS.

[64]  Pavan Balaji,et al.  Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions , 2013, 2013 42nd International Conference on Parallel Processing.

[65]  Paolo Bientinesi,et al.  The landscape of software for tensor computations , 2021, ArXiv.

[66]  Jin Xiong,et al.  Exploiting Program Semantics to Place Data in Hybrid Memory , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[67]  Olatunji Ruwase,et al.  ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.

[68]  Dong Li,et al.  Unimem: Runtime Data Management on Non-Volatile Memory-based Heterogeneous Main Memory , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[69]  Sriram Krishnamoorthy,et al.  A framework for load balancing of Tensor Contraction expressions via dynamic task partitioning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[70]  T. Esslinger Fermi-Hubbard Physics with Atoms in an Optical Lattice , 2010, 1007.0012.

[71]  Jiajia Li,et al.  Sparta: high-performance, element-wise sparse tensor contraction on heterogeneous memory , 2021, PPoPP.

[72]  Nikos D. Sidiropoulos,et al.  Tensor Decomposition for Signal Processing and Machine Learning , 2016, IEEE Transactions on Signal Processing.

[73]  Jimeng Sun,et al.  Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization , 2014, KDD.

[74]  Ryousei Takano,et al.  RAMinate: Hypervisor-based Virtualization for Hybrid Main Memory Systems , 2016, SoCC.

[75]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[76]  Maja Pantic,et al.  TensorLy: Tensor Learning in Python , 2016, J. Mach. Learn. Res..

[77]  Devin Matthews,et al.  High-Performance Tensor Contraction without BLAS , 2016, ArXiv.

[78]  Dimitrios S. Nikolopoulos,et al.  RIANN: Real-time Incremental Learning with Approximate Nearest Neighbor on Mobile Devices , 2020, OpML.