论文信息 - High-throughput, energy-efficient network-on-chip-based hardware accelerators

High-throughput, energy-efficient network-on-chip-based hardware accelerators

Abstract Several emerging application domains in scientific computing demand high computation throughputs to achieve terascale or higher performance. Dedicated centers hosting scientific computing tools on a few high-end servers could rely on hardware accelerator co-processors that contain multiple lightweight custom cores interconnected through an on-chip network. With increasing workloads, these many-core platforms need to deliver high overall computation throughput while also being energy-efficient. Conventional multicore architectures can achieve a limited computational throughput due to the inherent multi-hop nature of the on-chip network infrastructure. By inserting long-range links that act as shortcuts in a regular network-on-chip (NoC) architecture, both the achievable bandwidth and energy efficiency of a multicore platform can be significantly enhanced. In this paper, we first propose a NoC-driven use-case model for throughput-oriented scientific applications, and subsequently use the model to study the effect of using long-range links in conjunction with different resource allocation strategies on reducing the overall on-chip communication and enhancing computational throughput. NoCs with both wired and on-chip wireless links are explored in the study. We also evaluate our NoC-based platforms with respect to energy-efficiency and power consumption. We analyze how throughput and power consumption are correlated with the statistical properties of the application traffic. In addition, we compare and analyze chip-level thermal profiles for these alternatives. Our experiments using kernels from a popular phylogenetic inference application suite show that we can deliver computation throughput over 1011 operations per second, consuming ∼0.5 nJ per operation, while ensuring that on-chip temperature variation is within 26 °C.

Ananth Kalyanaraman | Partha Pratim Pande | Turbo Majumder

[1] Pedro Trancoso,et al. Fine-grain Parallelism Using Multi-core, Cell/BE, and GPU Systems: Accelerating the Phylogenetic Likelihood Function , 2009, 2009 International Conference on Parallel Processing.

[2] Partha Pratim Pande,et al. Hardware accelerators for biocomputing: A survey , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[3] Weiguo Liu,et al. Streaming Algorithms for Biological Sequence Alignment on GPUs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[4] Massimo Ruo Roch,et al. A Case Study for NoC-Based Homogeneous MPSoC Architectures , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[5] Partha Pratim Pande,et al. Accelerating Maximum Likelihood Based Phylogenetic Kernels Using Network-on-Chip , 2011, 2011 23rd International Symposium on Computer Architecture and High Performance Computing.

[6] Radu Marculescu,et al. "It's a small world after all": NoC performance optimization via long-range link insertion , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[7] Hoi-Jun Yoo,et al. Power and Area-Efficient Unified Computation of Vector and Elementary Functions for Handheld 3D Graphics Systems , 2008, IEEE Transactions on Computers.

[8] Partha Pratim Pande,et al. NoC-Based Hardware Accelerator for Breakpoint Phylogeny , 2012, IEEE Transactions on Computers.

[9] D. Hilbert. Ueber die stetige Abbildung einer Line auf ein Flächenstück , 1891 .

[10] Natalie D. Enright Jerger,et al. Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspectives , 2009, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[11] Michael Kistler,et al. Exploring the Viability of the Cell Broadband Engine for Bioinformatics Applications , 2007, IPDPS.

[12] Martin C. Herbordt,et al. Single Pass, BLAST-Like, Approximate String Matching on FPGAs , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[13] Ran Ginosar,et al. Generalized MultiAmdahl: Optimization of Heterogeneous Multi-Accelerator SoC , 2014, IEEE Computer Architecture Letters.

[14] Barmak Honarvar,et al. Efficient Hardware Accelerators for the Computation of Tchebichef Moments , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[15] Partha Pratim Pande,et al. Network-on-Chip Hardware Accelerators for Biological Sequence Alignment , 2010, IEEE Transactions on Computers.

[16] K. Kempa,et al. Carbon Nanotubes as Optical Antennae , 2007 .

[17] Partha Pratim Pande,et al. On-Chip Network-Enabled Multicore Platforms Targeting Maximum Likelihood Phylogeny Reconstruction , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[18] Alexandros Stamatakis,et al. Exploring FPGAs for accelerating the phylogenetic likelihood function , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[19] Kevin Skadron,et al. Accurate, Pre-RTL Temperature-Aware Design Using a Parameterized, Geometric Thermal Model , 2008, IEEE Transactions on Computers.

[20] Geppino Pucci,et al. Universality in VLSI Computation , 2011, ParCo 2011.

[21] Terrence S. T. Mak,et al. High speed GAML-based phylogenetic tree reconstruction using HW/SW codesign , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[22] Bertil Schmidt,et al. Using reconfigurable hardware to accelerate multiple sequence alignment with ClustalW , 2005, Bioinform..

[23] Jason D. Bakos,et al. A Special-Purpose Architecture for Solving the Breakpoint Median Problem , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24] W. Dally,et al. Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[25] Partha Pratim Pande,et al. Performance evaluation and design trade-offs for network-on-chip interconnect architectures , 2005, IEEE Transactions on Computers.

[26] D. Hilbert. Über die stetige Abbildung einer Linie auf ein Flächenstück , 1935 .

[27] Alexandros Stamatakis,et al. RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[28] Maria Kihl,et al. Sustainable Computing: Informatics and Systems , 2012 .

[29] Christof Teuscher,et al. Scalable Hybrid Wireless Network-on-Chip Architectures for Multicore Systems , 2011, IEEE Transactions on Computers.

[30] Alexandros Stamatakis,et al. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..