论文信息 - Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory

Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory

HPC systems are increasingly used for data intensive computations which exhibit irregular memory accesses, non-uniform work distributions, large memory footprints, and high memory bandwidth demands. To address these challenging demands, HPC systems are turning to many-core architectures that feature a large number of energy-efficient cores backed by high-bandwidth memory. These features are exemplified in Intel's recent Knights Landing many-core processor (KNL), which typically has 68 cores and 16GB of on-package multi-channel DRAM (MCDRAM). This work investigates how the novel architectural features offered by KNL can be used in the context of decomposing sparse, unstructured tensors using the canonical polyadic decomposition (CPD). The CPD is used extensively to analyze large multi-way datasets arising in various areas including precision healthcare, cybersecurity, and e-commerce. Towards this end, we (i) develop problem decompositions for the CPD which are amenable to hundreds of concurrent threads while maintaining load balance and low synchronization costs; and (ii) explore the utilization of architectural features such as MCDRAM. Using one KNL processor, our algorithm achieves up to 1.8x speedup over a dual socket Intel Xeon system with 44 cores.

George Karypis | Shaden Smith | Jongsoo Park

[1] Martha Larson,et al. TFMAP: optimizing MAP for top-n context-aware recommendation , 2012, SIGIR '12.

[2] Jure Leskovec,et al. Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[3] Pradeep Dubey,et al. High Order Seismic Simulations on the Intel Xeon Phi Processor (Knights Landing) , 2016, ISC.

[4] Benoît Meister,et al. Low-overhead load-balanced scheduling for sparse tensor computations , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[5] Michael Klemm,et al. A User-Guided Locking API for the OpenMP* Application Program Interface , 2014, IWOMP.

[6] Xing Liu,et al. Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[7] Hadi Fanaee-T,et al. Tensor-based anomaly detection: An interdisciplinary survey , 2016, Knowl. Based Syst..

[8] Steffen Staab,et al. PINTS: peer-to-peer infrastructure for tagging systems , 2008, IPTPS.

[9] Brian Vinter,et al. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[10] James Bennett,et al. The Netflix Prize , 2007 .

[11] George Karypis,et al. Tensor-matrix products with a compressed sparse tensor , 2015, IA3@SC.

[12] Nikos D. Sidiropoulos,et al. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[13] Tamara G. Kolda,et al. Tensor Decompositions and Applications , 2009, SIAM Rev..

[14] Yehuda Koren,et al. The Yahoo! Music Dataset and KDD-Cup '11 , 2012, KDD Cup.

[15] Nikos D. Sidiropoulos,et al. Memory-efficient parallel computation of tensor and matrix products for big tensor decomposition , 2014, 2014 48th Asilomar Conference on Signals, Systems and Computers.

[16] Bora Uçar,et al. Scalable sparse tensor decompositions in distributed memory systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17] J. H. Choi,et al. DFacTo: Distributed Factorization of Tensors , 2014, NIPS.

[18] G. Karypis,et al. A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization , 2016 .

[19] Yen-Chen Liu,et al. Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[20] Michael Garland,et al. Eﬃcient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[21] Thomas B. Rolinger,et al. Performance Evaluation of Parallel Sparse Tensor Decomposition Implementations , 2016, 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3).

[22] Estevam R. Hruschka,et al. Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[23] Jimeng Sun,et al. Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics , 2015, KDD.