Athena: high-performance sparse tensor contraction sequence on heterogeneous memory
暂无分享,去创建一个
Jiajia Li | Dong Li | Roberto Gioiosa | Jiawen Liu | Dong Li | R. Gioiosa | Jiajia Li | Jiawen Liu
[1] George Karypis,et al. Accelerating the Tucker Decomposition with Compressed Sparse Tensors , 2017, Euro-Par.
[2] Alfio Lazzaro,et al. DBCSR: A Blocked Sparse Tensor Algebra Library , 2019, PARCO.
[3] Evgeny Epifanovsky,et al. New implementation of high‐level correlated methods using a general block tensor library for high‐performance electronic structure calculations , 2013, J. Comput. Chem..
[4] Frank Neese,et al. Sparse maps--A systematic infrastructure for reduced-scaling electronic structure methods. II. Linear scaling domain based pair natural orbital coupled cluster theory. , 2016, The Journal of chemical physics.
[5] Minjia Zhang,et al. Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning , 2021, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).
[6] John F. Stanton,et al. A massively parallel tensor contraction framework for coupled-cluster computations , 2014, J. Parallel Distributed Comput..
[7] Keshav Pingali,et al. Single machine graph analytics on massive datasets using Intel optane DC persistent memory , 2019, Proc. VLDB Endow..
[8] Rachata Ausavarungnirun,et al. Row buffer locality aware caching policies for hybrid memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).
[9] Jie Ren,et al. Exploring Non-Volatility of Non-Volatile Memory for High Performance Computing Under Failures , 2020, 2020 IEEE International Conference on Cluster Computing (CLUSTER).
[10] Jimeng Sun,et al. HiCOO: Hierarchical Storage of Sparse Tensors , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[11] Kim Batselier,et al. Faster Tensor Train Decomposition for Sparse Data , 2019, J. Comput. Appl. Math..
[12] Ricardo Bianchini,et al. Page placement in hybrid memory systems , 2011, ICS '11.
[13] Sudarsun Kannan,et al. Durable Transactional Memory Can Scale with Timestone , 2020, ASPLOS.
[14] AutoHOOT: Automatic High-Order Optimization for Tensors , 2020, PACT.
[15] Richard Veras,et al. Analytical cache modeling and tilesize optimization for tensor contractions , 2019, SC.
[16] Onur Mutlu,et al. Panthera: holistic memory management for big data processing over hybrid memories , 2019, PLDI.
[17] Daniel Kats,et al. Sparse tensor framework for implementation of general local correlation methods. , 2013, The Journal of chemical physics.
[18] Sachin Katti,et al. Reducing DRAM footprint with NVM in Facebook , 2018, EuroSys.
[19] Richard W. Vuduc,et al. Load-Balanced Sparse MTTKRP on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[20] Anima Anandkumar,et al. Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..
[21] Tamara G. Kolda,et al. Tensor Decompositions and Applications , 2009, SIAM Rev..
[22] Xu Liu,et al. ATMem: adaptive data placement in graph applications on heterogeneous memories , 2020, CGO.
[23] Ada Gavrilovska,et al. HeteroOS — OS design for heterogeneous memory management in datacenter , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[24] Adam Zalcman,et al. TensorNetwork: A Library for Physics and Machine Learning , 2019, ArXiv.
[25] James Demmel,et al. Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[26] E. M. Stoudenmire,et al. The ITensor Software Library for Tensor Network Calculations , 2020, SciPost Physics Codebases.
[27] Marcos K. Aguilera,et al. AIFM: High-Performance, Application-Integrated Far Memory , 2020, OSDI.
[28] Robert J. Harrison,et al. Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure , 2020, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[29] Hans-Joachim Werner,et al. Parallel and Low-Order Scaling Implementation of Hartree-Fock Exchange Using Local Density Fitting. , 2016, Journal of chemical theory and computation.
[30] Dong Li,et al. Runtime Data Management on Non-Volatile Memory-based Heterogeneous Memory for Task-Parallel Programs , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[31] Samuel Williams,et al. Auto-tuning performance on multicore computers , 2008 .
[32] Dong Li,et al. MD-HM: memoization-based molecular dynamics simulations on big memory system , 2021, ICS.
[33] Justus A. Calvin,et al. Massively Parallel Implementation of Explicitly Correlated Coupled-Cluster Singles and Doubles Using TiledArray Framework. , 2016, The journal of physical chemistry. A.
[34] Bora Uçar,et al. Parallel Candecomp/Parafac Decomposition of Sparse Tensors Using Dimension Trees , 2018, SIAM J. Sci. Comput..
[35] Evgeny Epifanovsky,et al. A General Sparse Tensor Framework for Electronic Structure Theory. , 2017, Journal of chemical theory and computation.
[36] Richard W. Vuduc,et al. Optimizing Sparse Tensor Times Matrix on Multi-core and Many-Core Architectures , 2016, 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3).
[37] Sriram Krishnamoorthy,et al. Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs , 2018, ICS.
[38] Anand D. Sarwate,et al. A Unified Optimization Approach for Sparse Tensor Operations on GPUs , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).
[39] Taesoo Kim,et al. Recipe: converting concurrent DRAM indexes to persistent-memory indexes , 2019, SOSP.
[40] Jimeng Sun,et al. Model-Driven Sparse CP Decomposition for Higher-Order Tensors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[41] S. Hirata. Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories , 2003 .
[42] Sriram Krishnamoorthy,et al. Performance optimization of tensor contraction expressions for many-body methods in quantum chemistry. , 2009, The journal of physical chemistry. A.
[43] David E. Bernholdt,et al. Automatic code generation for many-body electronic structure methods: the tensor contraction engine , 2006 .
[44] Karsten Schwan,et al. Data tiering in heterogeneous memory systems , 2016, EuroSys.
[45] Andrzej Cichocki,et al. Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions , 2014, ArXiv.
[46] Dong Li,et al. Processing-in-Memory for Energy-Efficient Neural Network Training: A Heterogeneous Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[47] Sriram Krishnamoorthy,et al. An efficient mixed-mode representation of sparse tensors , 2019, SC.
[48] Zi Yan,et al. Nimble Page Management for Tiered Memory Systems , 2019, ASPLOS.
[49] Jeffery S. Boschen,et al. NWChem: Past, present, and future. , 2020, The Journal of chemical physics.
[50] G. Karypis,et al. A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization , 2016 .
[51] T. Crawford,et al. An Introduction to Coupled Cluster Theory for Computational Chemists , 2007 .
[52] Jie Liu,et al. Performance Analysis and Characterization of Training Deep Learning Models on Mobile Device , 2019, 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS).
[53] Anima Anandkumar,et al. Tensor Contractions with Extended BLAS Kernels on CPU and GPU , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).
[54] Woongki Baek,et al. Design and implementation of bandwidth-aware memory placement and migration policies for heterogeneous memory systems , 2017, ICS '17.
[55] Sriram Krishnamoorthy,et al. A Code Generator for High-Performance Tensor Contractions on GPUs , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[56] Thomas F. Wenisch,et al. Thermostat: Application-transparent Page Management for Two-tiered Main Memory , 2017, ASPLOS.
[57] Steven Swanson,et al. An Empirical Guide to the Behavior and Use of Scalable Persistent Memory , 2019, FAST.
[58] Minjia Zhang,et al. HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory , 2020, NeurIPS.
[59] Shoaib Kamil,et al. The tensor algebra compiler , 2017, Proc. ACM Program. Lang..
[60] Jimeng Sun,et al. Efficient and effective sparse tensor reordering , 2019, ICS.
[61] Nikos D. Sidiropoulos,et al. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.
[62] Yogish Sabharwal,et al. On Optimizing Distributed Tucker Decomposition for Sparse Tensors , 2018, ICS.
[63] Dong Li,et al. Fast, flexible, and comprehensive bug detection for persistent memory programs , 2021, ASPLOS.
[64] Pavan Balaji,et al. Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions , 2013, 2013 42nd International Conference on Parallel Processing.
[65] Paolo Bientinesi,et al. The landscape of software for tensor computations , 2021, ArXiv.
[66] Jin Xiong,et al. Exploiting Program Semantics to Place Data in Hybrid Memory , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).
[67] Olatunji Ruwase,et al. ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.
[68] Dong Li,et al. Unimem: Runtime Data Management on Non-Volatile Memory-based Heterogeneous Main Memory , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[69] Sriram Krishnamoorthy,et al. A framework for load balancing of Tensor Contraction expressions via dynamic task partitioning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[70] T. Esslinger. Fermi-Hubbard Physics with Atoms in an Optical Lattice , 2010, 1007.0012.
[71] Jiajia Li,et al. Sparta: high-performance, element-wise sparse tensor contraction on heterogeneous memory , 2021, PPoPP.
[72] Nikos D. Sidiropoulos,et al. Tensor Decomposition for Signal Processing and Machine Learning , 2016, IEEE Transactions on Signal Processing.
[73] Jimeng Sun,et al. Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization , 2014, KDD.
[74] Ryousei Takano,et al. RAMinate: Hypervisor-based Virtualization for Hybrid Main Memory Systems , 2016, SoCC.
[75] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.
[76] Maja Pantic,et al. TensorLy: Tensor Learning in Python , 2016, J. Mach. Learn. Res..
[77] Devin Matthews,et al. High-Performance Tensor Contraction without BLAS , 2016, ArXiv.
[78] Dimitrios S. Nikolopoulos,et al. RIANN: Real-time Incremental Learning with Approximate Nearest Neighbor on Mobile Devices , 2020, OpML.