A high-performance, low-power linear algebra core
暂无分享,去创建一个
Robert A. van de Geijn | Andreas Gerstlauer | Ardavan Pedram | R. V. D. Geijn | A. Gerstlauer | A. Pedram
[1] Michael Parker. High-performance floating-point implementation using FPGAS , 2009, MILCOM 2009 - 2009 IEEE Military Communications Conference.
[2] A. Alvandpour,et al. A 6.2-GFlops Floating-Point Multiply-Accumulator With Conditional Normalization , 2006, IEEE Journal of Solid-State Circuits.
[3] Hyesoon Kim,et al. An integrated GPU power and performance model , 2010, ISCA.
[4] Mitsuhisa Sato,et al. Design and Power Performance Evaluation of On-Chip Memory Processor with Arithmetic Accelerators , 2008, 2008 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems.
[5] Michael J. Schulte,et al. Low-Power Multiple-Precision Iterative Floating-Point Multiplier with SIMD Support , 2009, IEEE Transactions on Computers.
[6] Georgi Kuzmanov,et al. Floating-Point Matrix Multiplication in a Polymorphic Processor , 2007, 2007 International Conference on Field-Programmable Technology.
[7] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[8] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[9] Anand Raghunathan,et al. Power analysis of system-level on-chip communication architectures , 2004, International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004..
[10] Ryan W. Apperson,et al. AsAP: An Asynchronous Array of Simple Processors , 2008, IEEE Journal of Solid-State Circuits.
[11] Nicolai Petkov,et al. Hyper-systolic matrix multiplication , 1998, Parallel Comput..
[12] E.E. Swartzlander,et al. Floating-Point Fused Multiply-Add Architectures , 2007, 2007 Conference Record of the Forty-First Asilomar Conference on Signals, Systems and Computers.
[13] Pat Hanrahan,et al. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.
[14] Jack Dongarra,et al. LAPACK Users' guide (third ed.) , 1999 .
[15] Ken Smits,et al. Penryn: 45-nm next generation Intel® core™ 2 processor , 2007, 2007 IEEE Asian Solid-State Circuits Conference.
[16] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[17] Norman P. Jouppi,et al. Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .
[18] S. Borkar,et al. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.
[19] Sanjay J. Patel,et al. Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.
[20] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.
[21] Samuel Williams,et al. The potential of the cell processor for scientific computing , 2005, CF '06.
[22] Sriram R. Vangal,et al. A 90mW/GFlop 3.4GHz Reconfigurable Fused/Continuous Multiply-Accumulator for Floating-Point and Integer Operands in 65nm , 2010, 2010 23rd International Conference on VLSI Design.
[23] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[24] Mark Horowitz,et al. Energy-Efficient Floating-Point Unit Design , 2011, IEEE Transactions on Computers.
[25] Stefania Perri,et al. A matrix product accelerator for field programmable systems on chip , 2008, Microprocess. Microsystems.
[26] Ed Anderson,et al. LAPACK Users' Guide , 1995 .
[27] Bruce Hendrickson,et al. The Torus-Wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers , 1994, SIAM J. Sci. Comput..
[28] P.T. Wolkotte,et al. Energy Model of Networks-on-Chip and a Bus , 2005, 2005 International Symposium on System-on-Chip.
[29] Satoru Yamamoto,et al. FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods , 2010, TRETS.
[30] M. J. Prelle,et al. Performance of a Multicore Matrix Multiplication Library , 2007 .
[31] Andreas Gerstlauer,et al. Towards a High-Performance , Low-Power Linear Algebra Processor , 2010 .
[32] Jack Dongarra,et al. An Improved MAGMA GEMM for Fermi GPUs , 2010 .
[33] Christoforos E. Kozyrakis,et al. Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.
[34] Brett M. Bode,et al. Performance analysis of memory transfers and GEMM subroutines on NVIDIA Tesla GPU cluster , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.
[35] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.
[36] Margaret Martonosi,et al. Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[37] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[38] Viktor K. Prasanna,et al. Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems , 2007, IEEE Transactions on Parallel and Distributed Systems.
[39] Bo Kågström,et al. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.
[40] Jack Dongarra,et al. ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.