High-throughput, energy-efficient network-on-chip-based hardware accelerators

Abstract Several emerging application domains in scientific computing demand high computation throughputs to achieve terascale or higher performance. Dedicated centers hosting scientific computing tools on a few high-end servers could rely on hardware accelerator co-processors that contain multiple lightweight custom cores interconnected through an on-chip network. With increasing workloads, these many-core platforms need to deliver high overall computation throughput while also being energy-efficient. Conventional multicore architectures can achieve a limited computational throughput due to the inherent multi-hop nature of the on-chip network infrastructure. By inserting long-range links that act as shortcuts in a regular network-on-chip (NoC) architecture, both the achievable bandwidth and energy efficiency of a multicore platform can be significantly enhanced. In this paper, we first propose a NoC-driven use-case model for throughput-oriented scientific applications, and subsequently use the model to study the effect of using long-range links in conjunction with different resource allocation strategies on reducing the overall on-chip communication and enhancing computational throughput. NoCs with both wired and on-chip wireless links are explored in the study. We also evaluate our NoC-based platforms with respect to energy-efficiency and power consumption. We analyze how throughput and power consumption are correlated with the statistical properties of the application traffic. In addition, we compare and analyze chip-level thermal profiles for these alternatives. Our experiments using kernels from a popular phylogenetic inference application suite show that we can deliver computation throughput over 1011 operations per second, consuming ∼0.5 nJ per operation, while ensuring that on-chip temperature variation is within 26 °C.

[1]  Pedro Trancoso,et al.  Fine-grain Parallelism Using Multi-core, Cell/BE, and GPU Systems: Accelerating the Phylogenetic Likelihood Function , 2009, 2009 International Conference on Parallel Processing.

[2]  Partha Pratim Pande,et al.  Hardware accelerators for biocomputing: A survey , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[3]  Weiguo Liu,et al.  Streaming Algorithms for Biological Sequence Alignment on GPUs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[4]  Massimo Ruo Roch,et al.  A Case Study for NoC-Based Homogeneous MPSoC Architectures , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[5]  Partha Pratim Pande,et al.  Accelerating Maximum Likelihood Based Phylogenetic Kernels Using Network-on-Chip , 2011, 2011 23rd International Symposium on Computer Architecture and High Performance Computing.

[6]  Radu Marculescu,et al.  "It's a small world after all": NoC performance optimization via long-range link insertion , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[7]  Hoi-Jun Yoo,et al.  Power and Area-Efficient Unified Computation of Vector and Elementary Functions for Handheld 3D Graphics Systems , 2008, IEEE Transactions on Computers.

[8]  Partha Pratim Pande,et al.  NoC-Based Hardware Accelerator for Breakpoint Phylogeny , 2012, IEEE Transactions on Computers.

[9]  D. Hilbert Ueber die stetige Abbildung einer Line auf ein Flächenstück , 1891 .

[10]  Natalie D. Enright Jerger,et al.  Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspectives , 2009, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[11]  Michael Kistler,et al.  Exploring the Viability of the Cell Broadband Engine for Bioinformatics Applications , 2007, IPDPS.

[12]  Martin C. Herbordt,et al.  Single Pass, BLAST-Like, Approximate String Matching on FPGAs , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[13]  Ran Ginosar,et al.  Generalized MultiAmdahl: Optimization of Heterogeneous Multi-Accelerator SoC , 2014, IEEE Computer Architecture Letters.

[14]  Barmak Honarvar,et al.  Efficient Hardware Accelerators for the Computation of Tchebichef Moments , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[15]  Partha Pratim Pande,et al.  Network-on-Chip Hardware Accelerators for Biological Sequence Alignment , 2010, IEEE Transactions on Computers.

[16]  K. Kempa,et al.  Carbon Nanotubes as Optical Antennae , 2007 .

[17]  Partha Pratim Pande,et al.  On-Chip Network-Enabled Multicore Platforms Targeting Maximum Likelihood Phylogeny Reconstruction , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[18]  Alexandros Stamatakis,et al.  Exploring FPGAs for accelerating the phylogenetic likelihood function , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[19]  Kevin Skadron,et al.  Accurate, Pre-RTL Temperature-Aware Design Using a Parameterized, Geometric Thermal Model , 2008, IEEE Transactions on Computers.

[20]  Geppino Pucci,et al.  Universality in VLSI Computation , 2011, ParCo 2011.

[21]  Terrence S. T. Mak,et al.  High speed GAML-based phylogenetic tree reconstruction using HW/SW codesign , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[22]  Bertil Schmidt,et al.  Using reconfigurable hardware to accelerate multiple sequence alignment with ClustalW , 2005, Bioinform..

[23]  Jason D. Bakos,et al.  A Special-Purpose Architecture for Solving the Breakpoint Median Problem , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[25]  Partha Pratim Pande,et al.  Performance evaluation and design trade-offs for network-on-chip interconnect architectures , 2005, IEEE Transactions on Computers.

[26]  D. Hilbert Über die stetige Abbildung einer Linie auf ein Flächenstück , 1935 .

[27]  Alexandros Stamatakis,et al.  RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[28]  Maria Kihl,et al.  Sustainable Computing: Informatics and Systems , 2012 .

[29]  Christof Teuscher,et al.  Scalable Hybrid Wireless Network-on-Chip Architectures for Multicore Systems , 2011, IEEE Transactions on Computers.

[30]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..