论文信息 - Implementation and Analysis of Block Dense Matrix Decomposition on Network-on-Chips

Implementation and Analysis of Block Dense Matrix Decomposition on Network-on-Chips

The decomposition of a dense matrix into lower and upper triangular matrices is an important linear algebra kernel that used in scientific and engineering applications. To decompose large matrices efficiently, the matrix is divided into sub-matrices as blocks. The block matrix decomposition is introduced for parallel hardware platforms, e.g. supercomputers, multicore processors and GPUs. Recently, the Network-on-Chip (NoC) paradigm is proposed as a promising multicore architecture for future Chip Multiprocessors (CMPs) with hundreds or even thousands of cores. The communication bottleneck of traditional bus or crossbar based on-chip interconnect is alleviated in the NoC architecture. However, the implementation and analysis of parallel block matrix decomposition in a NoC platform has not been well addressed. We design an NoC platform based on state-of-the-art systems. A block matrix decomposition algorithm is implemented on the NoC platform. Evaluation results are presented using a cycle accurate full system simulator. We achieve parallel efficiency of 74.8% with a 64-node NoC, which outperforms other three multiprocessor systems (30.5%, 67% and 50% respectively). We also analyzed the impact of block size, cache behavior and network pressure of the platform.

[1] Marco Aurélio Cavalcanti Pacheco,et al. LU Decomposition on GPUs: The Impact of Memory Access , 2010, 2010 22nd International Symposium on Computer Architecture and High Performance Computing Workshops.

[2] Pat Hanrahan,et al. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.

[3] Siamak Mohammadi,et al. Adaptive Input-Output Selection Based On-Chip Router Architecture , 2012, J. Low Power Electron..

[4] Jack Dongarra,et al. A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[5] Dinesh Manocha,et al. LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[6] Jack J. Dongarra,et al. High performance matrix inversion based on LU factorization for multicore architectures , 2011, MTAGS '11.

[7] Ninghui Sun,et al. Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8] Jack J. Dongarra,et al. Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[9] Nitin Chandrachoodan,et al. FPGA-Based High-Performance and Scalable Block LU Decomposition Architecture , 2012, IEEE Transactions on Computers.

[10] Laxmikant V. Kalé,et al. Mapping Dense LU Factorization on Multicore Supercomputer Nodes , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[11] Kanad Ghose,et al. Energy-efficient MESI cache coherence with pro-active snoop filtering for multicore microprocessors , 2008, Proceeding of the 13th international symposium on Low power electronics and design (ISLPED '08).

[12] Axel Jantsch,et al. Network on Chip : An architecture for billion transistor era , 2000 .

[13] Jack J. Dongarra,et al. Performance Study of LU Factorization with Low Communication Overhead on Multiprocessors , 1995, Parallel Process. Lett..

[14] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[15] Fredrik Larsson,et al. Simics: A Full System Simulation Platform , 2002, Computer.

[16] Jack Dongarra,et al. LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.

[17] Marcelo Yuffe,et al. A fully integrated multi-CPU, GPU and memory controller 32nm processor , 2011, 2011 IEEE International Solid-State Circuits Conference.

[18] Marc Tremblay,et al. A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC® Processor , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[19] John L. Hennessy,et al. The performance advantages of integrating block data transfer in cache-coherent multiprocessors , 1994, ASPLOS VI.

[20] Yong Dou,et al. A High Performance and Memory Efficient LU Decomposer on FPGAs , 2012, IEEE Transactions on Computers.

[21] David B. Davidson,et al. GPU-based LU decomposition for large method of moments problems , 2010 .

[22] Anoop Gupta,et al. The Stanford Dash multiprocessor , 1992, Computer.

[23] Hannu Tenhunen,et al. Memory-Efficient On-Chip Network With Adaptive Interfaces , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[24] Wolfgang E. Nagel,et al. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25] E. Fluhr,et al. Design and Implementation of the POWER6 Microprocessor , 2008, IEEE Journal of Solid-State Circuits.

[26] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.