Prototyping hardware support for irregular applications

The use of FPGA platforms developed with off-the-shelf soft cores has recently emerged as one of the most promising fast prototyping approaches to design, evaluate and validate new architectural components for multi- and many-core processors. The approach appears to provide valuable benefits: optimizations to complex designs can be evaluated directly in hardware, at speeds hundreds of times faster than simulation, with efforts apparently limited only to the development of the new components. However, current FPGA toolchains that allow quick deployment of system-on-chip designs still have troubles when implementing multiprocessor designs. Often, a significant effort is also required to address the limitations of these toolchains. In this paper we discuss the design of a multi-node FPGA prototype, developed with the Xilinx toolchain, for exploring components to optimize multi- and many-core processors for the execution of irregular applications. Irregular applications, such as data-mining and social network analysis, employ large, pointer-based data structures (graphs, unbalanced trees, unstructured grids) that present poor locality and are very difficult to partition. Commodity clusters, which integrate powerful multi-core cache-based processors, are optimized for locality and employ distributed memory programming models. Developing irregular applications on them is complex, and often it does not provide performance scaling. We designed a set of hardware/software components that can potentially enhance commodity processors for efficiently executing irregular applications on multi-node systems, and we have integrated and validated them by exploiting FPGA rapid prototyping. We present the components and the prototype, highlighting the benefits and challenges in using such approach for architectural studies. We present an initial study on the tradeoffs of the platform, showing how prototyping can be effective, but also underlining the aspects that still need to be improved in the toolchain to allow better and deeper analysis.

[1]  Gianluca Palermo,et al.  A design kit for a fully working shared memory multiprocessor on FPGA , 2007, GLSVLSI '07.

[2]  Alessandro Forin,et al.  Giano: The Two-Headed System Simulator , 2006 .

[3]  James R. Larus,et al.  Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator , 2000, IEEE Concurr..

[4]  Holger Fröning,et al.  Efficient hardware support for the Partitioned Global Address Space , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[5]  Luca Benini,et al.  Multi-processor operating system emulation framework with thermal feedback for systems-on-chip , 2007, GLSVLSI '07.

[6]  Paolo Meloni,et al.  An FPGA-Based Framework for Technology-Aware Prototyping of Multicore Embedded Architectures , 2010, IEEE Embedded Systems Letters.

[7]  James R. Larus,et al.  Efficient support for irregular applications on distributed-memory machines , 1995, PPOPP '95.

[8]  Anant Agarwal,et al.  Performance Tradeoffs in Multithreaded Processors , 1992, IEEE Trans. Parallel Distributed Syst..

[9]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[10]  Kunle Olukotun,et al.  ATLAS: A Chip-Multiprocessor with Transactional Memory Support , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[11]  Babak Falsafi,et al.  ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs , 2009, TRETS.

[12]  Simone Secchi,et al.  Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer , 2012, IEEE Transactions on Parallel and Distributed Systems.

[13]  Thomas F. Wenisch,et al.  SimFlex: Statistical Sampling of Computer System Simulation , 2006, IEEE Micro.

[14]  David A. Patterson,et al.  A case for FAME: FPGA architecture model execution , 2010, ISCA.

[15]  Laxmikant V. Kalé,et al.  BigSim: a parallel simulator for performance prediction of extremely large parallel machines , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[16]  Shobhit Kanaujia,et al.  FastMP: A Multi-core Simulation Methodology , 2006 .

[17]  A. Kumar,et al.  Implementation of an 8-Core, 64-Thread, Power-Efficient SPARC Server on a Chip , 2008, IEEE Journal of Solid-State Circuits.

[18]  John Wawrzynek,et al.  RAMP Blue: A Message-Passing Manycore System in FPGAs , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[19]  Dam Sunwoo,et al.  FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators , 2007, MICRO.

[20]  Dan Grossman,et al.  Crunching Large Graphs with Commodity Processors , 2011, HotPar.

[21]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[22]  Jung Ho Ahn,et al.  How to simulate 1000 cores , 2009, CARN.

[23]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[24]  David A. Patterson,et al.  RAMP: research accelerator for multiple processors - a community vision for a shared experimental parallel HW/SW platform , 2006, ISPASS.

[25]  David A. Bader,et al.  Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[26]  Gianluca Palermo,et al.  A Dual-Priority Real-Time Multiprocessor System on FPGA for Automotive Applications , 2008, 2008 Design, Automation and Test in Europe.

[27]  Gianluca Palermo,et al.  Prototyping pipelined applications on a heterogeneous FPGA multiprocessor virtual platform , 2009, 2009 Asia and South Pacific Design Automation Conference.