Intrepydd: performance, productivity, and portability for data science application kernels
暂无分享,去创建一个
Jun Shirako | Thomas M. Conte | Richard Vuduc | Tong Zhou | Vivek Sarkar | Anirudh Jain | Sriseshan Srikanth | R. Vuduc | Vivek Sarkar | T. Conte | Anirudh Jain | Tong Zhou | J. Shirako | S. Srikanth
[1] Artsiom Ablavatski,et al. Two-Pass Softmax Algorithm , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[2] Stanley C. Eisenstat,et al. Yale sparse matrix package I: The symmetric codes , 1982 .
[3] Michael Lange,et al. Devito: Towards a Generic Finite Difference DSL Using Symbolic Python , 2016, 2016 6th Workshop on Python for High-Performance and Scientific Computing (PyHPC).
[4] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.
[5] Michael I. Jordan,et al. Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.
[6] Michel Steuwer,et al. LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[7] Marat Dukhan. Indirect Deconvolution Algorithm , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[8] Franz Franchetti,et al. Mathematical foundations of the GraphBLAS , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).
[9] Thomas M. Conte,et al. Tackling memory access latency through DRAM row management , 2018, MEMSYS.
[10] Siu Kwan Lam,et al. Numba: a LLVM-based Python JIT compiler , 2015, LLVM '15.
[11] Richard W. Vuduc,et al. A Microbenchmark Characterization of the Emu Chick , 2018, Parallel Comput..
[12] Carole-Jean Wu,et al. Machine Learning at Facebook: Understanding Inference at the Edge , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[13] Razvan Pascanu,et al. Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.
[14] et al.,et al. Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.
[15] Hongbo Rong,et al. Sparso: Context-driven optimizations of sparse linear algebra , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).
[16] Saman P. Amarasinghe,et al. A Common Runtime for High Performance Data Analysis , 2017, CIDR.
[17] Matthew B. Dwyer,et al. Towards Self-Verification in Finite Difference Code Generation , 2017, CORRECTNESS@SC.
[18] Vivek Sarkar,et al. Optimal weighted loop fusion for parallel programs , 1997, SPAA '97.
[19] Richard W. Vuduc,et al. An Initial Characterization of the Emu Chick , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[20] Thomas M. Conte,et al. Rebooting Computing: The Road Ahead , 2017, Computer.
[21] Robert A. van de Geijn,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..
[22] Vivek Sarkar,et al. A Preliminary Study of Compiler Transformations for Graph Applications on the Emu System , 2018, MCHPC@SC.
[23] Gaël Varoquaux,et al. The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.
[24] Håkan Ardö,et al. Loop-aware optimizations in PyPy's tracing JIT , 2012, DLS '12.
[25] Mehmet Deveci,et al. Performance-Portable Sparse Matrix-Matrix Multiplication for Many-Core Architectures , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[26] Shoaib Kamil,et al. The tensor algebra compiler , 2017, Proc. ACM Program. Lang..
[27] Jeanine Cook,et al. MetaStrider , 2019, ACM Trans. Archit. Code Optim..
[28] Markus Püschel,et al. A Basic Linear Algebra Compiler , 2014, CGO '14.
[29] Anders Logg,et al. The FEniCS Project Version 1.5 , 2015 .
[30] Jun Yang,et al. FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs , 2018, ArXiv.
[31] Bradford L. Chamberlain,et al. Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..
[32] Endong Wang,et al. Intel Math Kernel Library , 2014 .
[33] Andrew C. Rice,et al. Verifying spatial properties of array computations , 2017, Proc. ACM Program. Lang..
[34] G Van ZeeField,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015 .
[35] M Mernik,et al. When and how to develop domain-specific languages , 2005, CSUR.
[36] Alan Edelman,et al. Julia: A Fast Dynamic Language for Technical Computing , 2012, ArXiv.
[37] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[38] Vivek Sarkar,et al. X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.
[39] John Salvatier,et al. Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.
[40] Stefan Behnel,et al. Cython: The Best of Both Worlds , 2011, Computing in Science & Engineering.
[41] Vivek Sarkar,et al. A Composable Deadlock-Free Approach to Object-Based Isolation , 2015, Euro-Par.
[42] Steve Plimpton,et al. FireHose Streaming Benchmarks , 2015 .
[43] Qian Wang,et al. AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[44] Timothy A. Davis,et al. Graph algorithms via SuiteSparse: GraphBLAS: triangle counting and K-truss , 2018, 2018 IEEE High Performance extreme Computing Conference (HPEC).
[45] William F. Tinney,et al. Techniques for Exploiting the Sparsity or the Network Admittance Matrix , 1963 .
[46] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.
[47] Ken Kennedy,et al. Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries , 2001, J. Parallel Distributed Comput..
[48] C. Pipper,et al. [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.