Implementation and Analysis of Block Dense Matrix Decomposition on Network-on-Chips
暂无分享,去创建一个
Tapio Salakoski | Hannu Tenhunen | Thomas Canhao Xu | Pasi Liljeberg | Tapio Pahikkala | Antti Airola | Juha Plosila | H. Tenhunen | T. Salakoski | T. Pahikkala | P. Liljeberg | J. Plosila | A. Airola | T. Xu
[1] Marco Aurélio Cavalcanti Pacheco,et al. LU Decomposition on GPUs: The Impact of Memory Access , 2010, 2010 22nd International Symposium on Computer Architecture and High Performance Computing Workshops.
[2] Pat Hanrahan,et al. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.
[3] Siamak Mohammadi,et al. Adaptive Input-Output Selection Based On-Chip Router Architecture , 2012, J. Low Power Electron..
[4] Jack Dongarra,et al. A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.
[5] Dinesh Manocha,et al. LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[6] Jack J. Dongarra,et al. High performance matrix inversion based on LU factorization for multicore architectures , 2011, MTAGS '11.
[7] Ninghui Sun,et al. Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[8] Jack J. Dongarra,et al. Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).
[9] Nitin Chandrachoodan,et al. FPGA-Based High-Performance and Scalable Block LU Decomposition Architecture , 2012, IEEE Transactions on Computers.
[10] Laxmikant V. Kalé,et al. Mapping Dense LU Factorization on Multicore Supercomputer Nodes , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[11] Kanad Ghose,et al. Energy-efficient MESI cache coherence with pro-active snoop filtering for multicore microprocessors , 2008, Proceeding of the 13th international symposium on Low power electronics and design (ISLPED '08).
[12] Axel Jantsch,et al. Network on Chip : An architecture for billion transistor era , 2000 .
[13] Jack J. Dongarra,et al. Performance Study of LU Factorization with Low Communication Overhead on Multiprocessors , 1995, Parallel Process. Lett..
[14] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[15] Fredrik Larsson,et al. Simics: A Full System Simulation Platform , 2002, Computer.
[16] Jack Dongarra,et al. LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.
[17] Marcelo Yuffe,et al. A fully integrated multi-CPU, GPU and memory controller 32nm processor , 2011, 2011 IEEE International Solid-State Circuits Conference.
[18] Marc Tremblay,et al. A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC® Processor , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.
[19] John L. Hennessy,et al. The performance advantages of integrating block data transfer in cache-coherent multiprocessors , 1994, ASPLOS VI.
[20] Yong Dou,et al. A High Performance and Memory Efficient LU Decomposer on FPGAs , 2012, IEEE Transactions on Computers.
[21] David B. Davidson,et al. GPU-based LU decomposition for large method of moments problems , 2010 .
[22] Anoop Gupta,et al. The Stanford Dash multiprocessor , 1992, Computer.
[23] Hannu Tenhunen,et al. Memory-Efficient On-Chip Network With Adaptive Interfaces , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[24] Wolfgang E. Nagel,et al. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[25] E. Fluhr,et al. Design and Implementation of the POWER6 Microprocessor , 2008, IEEE Journal of Solid-State Circuits.
[26] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.