Parallel asynchronous matrix multiplications for a distributed pipelined neural network
暂无分享,去创建一个
[1] Rui Machado,et al. Unbalanced tree search on a manycore system using the GPI programming model , 2011, Computer Science - Research and Development.
[2] Jeremy D. Frens,et al. Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.
[3] Lynn Elliot Cannon,et al. A cellular computer to implement the kalman filter algorithm , 1969 .
[4] Sreedhar B. Kodali,et al. The Asynchronous Partitioned Global Address Space Model , 2010 .
[5] Ralph Duncan,et al. A survey of parallel computer architectures , 1990, Computer.
[6] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.
[7] Pawan Kumar. Communication Optimal Least Squares Solver , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).
[8] Janis Keuper,et al. Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms , 2015, MLHPC@SC.
[9] Robert A. van de Geijn,et al. SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .
[10] C. Simmendinger,et al. The GASPI API specification and its implementation GPI 2.0 , 2013 .
[11] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.
[12] Johannes Schemmel,et al. A wafer-scale neuromorphic hardware system for large-scale neural modeling , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.
[13] Ramesh C. Agarwal,et al. A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication , 1994, IBM J. Res. Dev..
[14] Pavan Balaji,et al. Portable, MPI-interoperable coarray fortran , 2014, PPoPP '14.
[15] Amparo Alonso-Betanzos,et al. A Very Fast Learning Method for Neural Networks Based on Sensitivity Analysis , 2006, J. Mach. Learn. Res..
[16] Jarle Berntsen,et al. Communication efficient matrix multiplication on hypercubes , 1989, Parallel Comput..
[17] Yves Robert,et al. Matrix Multiplication on Heterogeneous Platforms , 2001, IEEE Trans. Parallel Distributed Syst..
[18] John Reid,et al. Coarrays in the next Fortran standard , 2010, FORF.
[19] Katherine Yelick,et al. Appendix B: UPC Collective Operations Specifications, v1.0 , 2005 .
[20] Zhang Zhang,et al. A UPC runtime system based on MPI and POSIX threads , 2006, 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06).
[21] Jane Sleightholme,et al. Compiler support for the Fortran 2003 standard , 2007, FORF.
[22] Jens Jägersküpper,et al. A PGAS-based Implementation for the Unstructured CFD Solver TAU , 2011 .
[23] Robert W. Numrich,et al. Co-array Fortran for parallel programming , 1998, FORF.
[24] Jürgen Schmidhuber,et al. Deep learning in neural networks: An overview , 2014, Neural Networks.
[25] Bradford L. Chamberlain,et al. Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..
[26] Rupak Majumdar,et al. A Theory of Partitioned Global Address Spaces , 2013, FSTTCS.
[27] Charu C. Aggarwal,et al. Neural Networks and Deep Learning , 2018, Springer International Publishing.
[28] William N. Scherer,et al. A new vision for coarray Fortran , 2009, PGAS '09.
[29] Victor Luchangco,et al. The Fortress Language Specification Version 1.0 , 2007 .
[30] Simon Haykin,et al. Neural Networks and Learning Machines , 2010 .
[31] Gerhard Wellein,et al. PGAS implementation of SpMVM and LBM using GPI , 2013 .
[32] Naoyuki Fukuda,et al. Massively parallel architectures for large scale neural network simulations , 1992, IEEE Trans. Neural Networks.
[33] Vincent Heuveline,et al. Evaluation of the Global Address Space Programming Interface (GASPI) , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.
[34] Jack J. Dongarra,et al. The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..
[35] David E. Bernholdt,et al. Exploring HPCS languages in scientific computing , 2008 .
[36] Holger Fröning,et al. Energy-Efficient Stencil Computations on Distributed GPUs Using Dynamic Parallelism and GPU-Controlled Communication , 2014, 2014 Energy Efficient Supercomputing Workshop.
[37] Bertil Svensson,et al. Using and Designing Massively Parallel Computers for Artificial Neural Neural Networks , 1992, J. Parallel Distributed Comput..
[38] Yifeng Chen,et al. Auto-tuning Dense Matrix Multiplication for GPGPU with Cache , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.
[39] Kevin P. Murphy,et al. Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.
[40] Hans P. Zima,et al. The cascade high productivity language , 2004 .
[41] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.
[42] Mohak Shah,et al. Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning , 2015, ArXiv.
[43] Hans Werner Pohl,et al. Flexible data parallel training of neural networks using MIMD-Computers , 1995, Proceedings Euromicro Workshop on Parallel and Distributed Processing.
[44] Dan Bonachea. GASNet Specification, v1.1 , 2002 .
[45] R. Lippmann,et al. An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.
[46] Rajeev Thakur,et al. UPC-IO: A Parallel I/O API for UPC , 2003 .
[47] Jaeyoung Choi. A new parallel matrix multiplication algorithm on distributed-memory concurrent computers , 1998, Concurr. Pract. Exp..
[48] Arthur L. Samuel,et al. Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..
[49] Daniel Grünewald. BQCD with GPI: A case study , 2012, 2012 International Conference on High Performance Computing & Simulation (HPCS).
[50] Rui Machado,et al. The Fraunhofer virtual machine: a communication library and runtime system based on the RDMA model , 2009, Computer Science - Research and Development.
[51] Michael J. Flynn,et al. Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.
[52] Geoffrey C. Fox,et al. Matrix algorithms on a hypercube I: Matrix multiplication , 1987, Parallel Comput..
[53] Jaeyoung Choi,et al. Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers , 1994, Concurr. Pract. Exp..
[54] P. Werstein,et al. Parallelization of a Backpropagation Neural Network on a Cluster Computer , 2022 .
[55] Jarek Nieplocha,et al. SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[56] Lena Oden,et al. GPI2 for GPUs: A PGAS framework for efficient communication in hybrid clusters , 2013, PARCO.
[57] Anthony Skjellum,et al. A poly‐algorithm for parallel dense matrix multiplication on two‐dimensional process grid topologies , 1997 .
[58] Jarek Nieplocha,et al. Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..
[59] Andrea C. Arpaci-Dusseau,et al. Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.
[60] Götz Alefeld,et al. Parallele numerische Verfahren , 2002 .
[61] George Dahl,et al. Parallelizing neural network training for cluster systems , 2008 .
[62] Udo Seiffert,et al. Artificial Neural Networks on Massively Parallel Computer Hardware , 2004, ESANN.
[63] Gerhard Wellein,et al. Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.
[64] Katherine A. Yelick,et al. Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..
[65] Sepp Hochreiter,et al. Untersuchungen zu dynamischen neuronalen Netzen , 1991 .
[66] D. Skillicorn. Strategies for Parallelizing Supervised and Unsupervised Learning in Arti cial Neural Networks Using the BSP Cost Model , 1997 .
[67] Jens Breitbart,et al. Dataflow-like Synchronization in a PGAS Programming Model , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[68] Fernando Doglio. When to Scale , 2018 .
[69] Thomas Rauber,et al. Parallel Programming: for Multicore and Cluster Systems , 2010, Parallel Programming, 3rd Ed..
[70] Mirko Rahn,et al. The GASPI API: A Failure Tolerant PGAS API for Asynchronous Dataflow on Heterogeneous Architectures , 2015 .
[71] S. Huss-Lederman,et al. Comparison of scalable parallel matrix multiplication libraries , 1993, Proceedings of Scalable Parallel Libraries Conference.
[72] Barbara M. Chapman,et al. John von Neumann Institute for Computing Enhancing OpenMP and Its Implementation for Programming Multicore Systems , 2008 .
[73] David Padua,et al. Encyclopedia of Parallel Computing , 2011 .