PMSS: A programmable memory system and scheduler for complex memory patterns

HPC industry demands more computing units on FPGAs, to enhance the performance by using task/data parallelism. FPGAs can provide its ultimate performance on certain kernels by customizing the hardware for the applications. However, applications are getting more complex, with multiple kernels and complex data arrangements, generating overhead while scheduling/managing system resources. Due to this reason all classes of multi threaded machines–minicomputer to supercomputer–require to have efficient hardware scheduler and memory manager that improves the effective bandwidth and latency of the DRAM main memory. This architecture could be a very competitive choice for supercomputing systems that meets the demand of parallelism for HPC benchmarks. In this article, we proposed a Programmable Memory System and Scheduler (PMSS), which provides high speed complex data access pattern to the multi threaded architecture. This proposed PMSS system is implemented and tested on a Xilinx ML505 evaluation FPGA board. The performance of the system is compared with a microprocessor based system that has been integrated with the Xilkernel operating system. Results show that the modified PMSS based multi-accelerator system consumes 50% less hardware resources, 32% less on-chip power and achieves approximately a 19x speedup compared to the MicroBlaze based system.

[1]  Ben H. H. Juurlink,et al.  A Case for Hardware Task Management Support for the StarSS Programming Model , 2010, 2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools.

[2]  Martin Burtscher,et al.  Efficient emulation of hardware prefetchers via event-driven helper threading , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Wu-chun Feng,et al.  A first look at integrated GPUs for green high-performance computing , 2010, Computer Science - Research and Development.

[4]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[5]  Purnendu Sinha,et al.  A Hardware Accelerator for Controlling Access to Multiple-Unit Resources in Safety/Time-Critical Systems , 2005, SAMOS.

[6]  Cédric Augonnet,et al.  Data-Aware Task Scheduling on Multi-accelerator Based Platforms , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[7]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[8]  Wei Hu,et al.  Hardware Assistant Scheduling for Synergistic Core Tasks on Embedded Heterogeneous Multi-core System ? , 2008 .

[9]  James H. Anderson,et al.  Parallel task scheduling on multicore platforms , 2006, SIGBED.

[10]  Eduard Ayguadé,et al.  Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[11]  S. Chai,et al.  Stream Memory Subsystem in Reconfigurable Platforms , 2005 .

[12]  Eduard Ayguadé,et al.  PPMC: A Programmable Pattern Based Memory Controller , 2012, ARC.

[13]  Tassadaq Hussain,et al.  PGC: a pattern-based graphics controller , 2014 .

[14]  Kurt Keutzer,et al.  An FPGA-based soft multiprocessor system for IPv4 packet forwarding , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[15]  Eduard Ayguadé Parra,et al.  Reconfigurable memory controller with programmable pattern support , 2011, HIPEAC 2011.

[16]  Vincenzo Piuri,et al.  A QoS-enabled packet scheduling algorithm for IPSec multi-accelerator based systems , 2005, CF '05.

[17]  Eduard Ayguadé,et al.  Implementation of a Reverse Time Migration kernel using the HCE High Level Synthesis tool , 2011, 2011 International Conference on Field-Programmable Technology.

[18]  Georgi Gaydadjiev,et al.  SAMS multi-layout memory: providing multiple views of data to boost SIMD performance , 2010, ICS '10.

[19]  Volodymyr Kindratenko,et al.  QP: A Heterogeneous Multi-Accelerator Cluster , 2011 .

[20]  Wei-Chung Hsu,et al.  Dynamic helper threaded prefetching on the Sun UltraSPARC/spl reg/ CMP processor , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[21]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[22]  Adrian Park,et al.  Designing Modular Hardware Accelerators in C with ROCCC 2.0 , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[23]  Peng Liu,et al.  An Efficient Architectural Design of Hardware Interface for Heterogeneous Multi-core System , 2011, NPC.

[24]  Mahmut T. Kandemir,et al.  Organizing the last line of defense before hitting the memory wall for CMPs , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[25]  Eduard Ayguadé,et al.  Stand-Alone Memory Controller for Graphics System , 2014, ARC.

[26]  James Coole,et al.  Traversal caches: a first step towards FPGA acceleration of pointer-based data structures , 2008, CODES+ISSS '08.

[27]  Philip J. Hatcher,et al.  Data-Parallel Programming on MIMD Computers , 1991, IEEE Trans. Parallel Distributed Syst..

[28]  Eduard Ayguadé,et al.  PVMC: Programmable Vector Memory Controller , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[29]  Wei Wu,et al.  On-Chip Memory System Optimization Design for the FT64 Scientific Stream Accelerator , 2008, IEEE Micro.

[30]  Sally A. McKee,et al.  Dynamic Access Ordering for Streamed Computations , 2000, IEEE Trans. Computers.

[31]  David W. Nellans,et al.  Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  Sascha Uhrig,et al.  RTOS Support for Parallel Execution of Hard Real-Time Applications on the MERASA Multi-core Processor , 2010, 2010 13th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing.

[33]  Eduard Ayguadé,et al.  PPMC: Hardware scheduling and memory management support for multi accelerators , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[34]  Sally A. McKee,et al.  Reflections on the memory wall , 2004, CF '04.

[35]  Francisco J. Cazorla,et al.  A dynamic scheduler for balancing HPC applications , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.