BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing
暂无分享,去创建一个
Yi Yang | Jianxiong Xiao | Wei Wu | Linnan Wang | Jianxiong Xiao | Yezhou Yang | Linnan Wang | Wei Wu
[1] Robert D. Blumofe,et al. Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.
[2] George Bosilca,et al. Hierarchical DAG Scheduling for Hybrid Distributed Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.
[3] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.
[4] Joseph Y.-T. Leung,et al. On the complexity of fixed-priority scheduling of periodic, real-time tasks , 1982, Perform. Evaluation.
[5] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.
[6] Raimund Seidel,et al. On the All-Pairs-Shortest-Path Problem in Unweighted Undirected Graphs , 1995, J. Comput. Syst. Sci..
[7] D. Griffin,et al. Finite-Element Analysis , 1975 .
[8] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.
[9] Robert A. van de Geijn,et al. SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.
[10] Patrice Y. Simard,et al. High Performance Convolutional Neural Networks for Document Processing , 2006 .
[11] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..
[12] Jack J. Dongarra,et al. Optimizing symmetric dense matrix-vector multiplication on GPUs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[13] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[14] R. M. Tomasulo,et al. An efficient algorithm for exploiting multiple arithmetic units , 1995 .
[15] Martin P. Bendsøe. Topology Optimization , 2009, Encyclopedia of Optimization.
[16] Francisco J. Cazorla,et al. Adapting cache partitioning algorithms to pseudo-LRU replacement policies , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[17] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.
[18] Maged M. Michael,et al. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.
[19] Alan Jay Smith,et al. A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.
[20] Jack J. Dongarra,et al. Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.
[21] Yi Yang,et al. Accelerating Deep Neural Network Training with Inconsistent Stochastic Gradient Descent , 2016, Neural Networks.
[22] Wei Wu,et al. Large Scale Artificial Neural Network Training Using Multi-GPUs , 2015, ArXiv.