A novel access pattern-based multi-core memory architecture

Increasingly High-Performance Computing (HPC) applications run on heterogeneous multi-core platforms. The basic reason of the growing popularity of these architectures is their low power consumption, and high throughput oriented nature. However, this throughput imposes a requirement on the data to be supplied in a high throughput manner for the multi-core system. This results in the necessity of an efficient management of on-chip and off-chip memory data transfers, which is a significant challenge. Complex regular and irregular memory data transfer patterns are becoming widely dominant for a range of application domains including the scientific, image and signal processing. Data accesses can be arranged in independent patterns that an efficient memory management can exploit. The software based approaches using general purpose caches and on-chip memories are beneficial to some extent. However, the task of efficient data management for the throughput oriented devices could be improved by providing hardware mechanisms that exploit the knowledge of access patterns in memory management and scheduling of accesses for a heterogeneous multi-core architecture. The focus of this thesis is to present architectural explorations for a novel access pattern-based multi-core memory architecture. In general, the thesis covers four main aspects of memory system in this research. These aspects can be categorized as: i) Uni-core Memory System for Regular Data Pattern. ii) Multi-core Memory System for Regular Data Pattern. iii) Uni-core Memory System for Irregular Data Pattern. and iv) Multi-core Memory System for Irregular Data Pattern.

[1]  Guy Lemieux,et al.  VENICE: A Compact Vector Processor for FPGA Applications , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[2]  Eduard Ayguadé,et al.  AMMC: Advanced Multi-Core Memory Controller , 2014, 2014 International Conference on Field-Programmable Technology (FPT).

[3]  Uzi Vishkin,et al.  Fpga-based prototype of a pram-on-chip processor , 2008, CF '08.

[4]  Jun Shao,et al.  A Burst Scheduling Access Reordering Mechanism , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[5]  Eduard Ayguadé,et al.  APMC: advanced pattern based memory controller (abstract only) , 2014, FPGA.

[6]  Pedro C. Diniz,et al.  Data search and reorganization using FPGAs: application to spatial pointer-based data structures , 2003, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. FCCM 2003..

[7]  Zhen Fang,et al.  The Impulse Memory Controller , 2001, IEEE Trans. Computers.

[8]  Tassadaq Hussain,et al.  PGC: a pattern-based graphics controller , 2014 .

[9]  Rajeev Barua,et al.  Dynamic allocation for scratch-pad memory using compile-time decisions , 2006, TECS.

[10]  Mateo Valero,et al.  Command vector memory systems: high performance at low cost , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[11]  Guy Lemieux,et al.  VEGAS: soft vector processor with scratchpad memory , 2011, FPGA '11.

[12]  Mahmut T. Kandemir,et al.  Organizing the last line of defense before hitting the memory wall for CMPs , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[13]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[14]  Christoforos E. Kozyrakis,et al.  Overcoming the limitations of conventional vector processors , 2003, ISCA '03.

[15]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[16]  Young-Hyun Jun,et al.  A 1.2 V 12.8 GB/s 2 Gb Mobile Wide-I/O DRAM With 4 $\times$ 128 I/Os Using TSV Based Stacking , 2011, IEEE Journal of Solid-State Circuits.

[17]  Rajeev Barua,et al.  Heap data allocation to scratch-pad memory in embedded systems , 2005, J. Embed. Comput..

[18]  Eduard Ayguadé Parra,et al.  Reconfigurable memory controller with programmable pattern support , 2011, HIPEAC 2011.

[19]  Cédric Augonnet,et al.  Data-Aware Task Scheduling on Multi-accelerator Based Platforms , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[20]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[21]  Pen-Chung Yew,et al.  The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors , 1987 .

[22]  Ting Chen,et al.  WCET centric data allocation to scratchpad memory , 2005, 26th IEEE International Real-Time Systems Symposium (RTSS'05).

[23]  Peter Marwedel,et al.  Reducing energy consumption by dynamic copying of instructions onto onchip memory , 2002, 15th International Symposium on System Synthesis, 2002..

[24]  Norman P. Jouppi,et al.  How useful are non-blocking loads, stream buffers and speculative execution in multiple issue processors? , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[25]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[26]  Sally A. McKee,et al.  Reflections on the memory wall , 2004, CF '04.

[27]  K. Saban Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity , Bandwidth , and Power Efficiency , 2009 .

[28]  Eduard Ayguadé,et al.  PMSS: A programmable memory system and scheduler for complex memory patterns , 2014, J. Parallel Distributed Comput..

[29]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[30]  Jean-François Deverge,et al.  WCET-Directed Dynamic Scratchpad Memory Allocation of Data , 2007, 19th Euromicro Conference on Real-Time Systems (ECRTS'07).

[31]  Fabio Pellizzer,et al.  Non-Volatile semiconductor memories for nano-scale technology , 2010, IEEE International Conference on Nanotechnology.

[32]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[33]  Adrian Park,et al.  Designing Modular Hardware Accelerators in C with ROCCC 2.0 , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[34]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[35]  Kyoung-Rok Cho,et al.  Memristor MOS Content Addressable Memory (MCAM): Hybrid Architecture for Future High Performance Search Engines , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[36]  James Coole,et al.  A Traversal Cache Framework for FPGA Acceleration of Pointer Data Structures: A Case Study on Barnes-Hut N-body Simulation , 2009, 2009 International Conference on Reconfigurable Computing and FPGAs.

[37]  Kurt Keutzer,et al.  An FPGA-based soft multiprocessor system for IPv4 packet forwarding , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[38]  Martin Burtscher,et al.  Efficient emulation of hardware prefetchers via event-driven helper threading , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[39]  Marc Tremblay,et al.  Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor , 2009, ISCA '09.

[40]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[41]  Michael Weiss Strip mining on SIMD architectures , 1991, ICS '91.

[42]  Mor Harchol-Balter,et al.  ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[43]  Eduard Ayguadé,et al.  Implementation of a Reverse Time Migration kernel using the HCE High Level Synthesis tool , 2011, 2011 International Conference on Field-Programmable Technology.

[44]  Georgi Gaydadjiev,et al.  SAMS multi-layout memory: providing multiple views of data to boost SIMD performance , 2010, ICS '10.

[45]  Prateeksha Satyamoorthy,et al.  MRAM for Shared Memory in GPGPUs , .

[46]  Tom Feist,et al.  Vivado Design Suite , 2012 .

[47]  Eduard Ayguadé,et al.  PPMC: Hardware scheduling and memory management support for multi accelerators , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[48]  Wei Hu,et al.  Hardware Assistant Scheduling for Synergistic Core Tasks on Embedded Heterogeneous Multi-core System ? , 2008 .

[49]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[50]  Alexander V. Veidenbaum,et al.  An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors1 , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[51]  Eduard Ayguadé,et al.  Stand-Alone Memory Controller for Graphics System , 2014, ARC.

[52]  Eduard Ayguadé,et al.  AMC: Advanced Multi-accelerator Controller , 2015, Parallel Comput..

[53]  Ben H. H. Juurlink,et al.  A Case for Hardware Task Management Support for the StarSS Programming Model , 2010, 2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools.

[54]  J. Gregory Steffan,et al.  The microarchitecture of FPGA-based soft processors , 2005, CASES '05.

[55]  Wei Wu,et al.  On-Chip Memory System Optimization Design for the FT64 Scientific Stream Accelerator , 2008, IEEE Micro.

[56]  Jonathan Rose,et al.  Measuring the Gap Between FPGAs and ASICs , 2007, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[57]  Wu-chun Feng,et al.  A first look at integrated GPUs for green high-performance computing , 2010, Computer Science - Research and Development.

[58]  Eduard Ayguadé,et al.  Advanced Pattern based Memory Controller for FPGA based HPC applications , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[59]  Alexandru Nicolau,et al.  Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration , 1998 .

[60]  James Coole,et al.  Traversal caches: a first step towards FPGA acceleration of pointer-based data structures , 2008, CODES+ISSS '08.

[61]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[62]  Rakesh Krishnaiyer,et al.  Optimizing software data prefetches with rotating registers , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[63]  Eduard Ayguadé,et al.  MAPC: Memory access pattern based controller , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[64]  Philip J. Hatcher,et al.  Data-Parallel Programming on MIMD Computers , 1991, IEEE Trans. Parallel Distributed Syst..

[65]  Sven Nordholm,et al.  FPGA multi-filter system for speech enhancement via multi-criteria optimization , 2014, Appl. Soft Comput..

[66]  Purnendu Sinha,et al.  A hardware accelerator for controlling access to multiple-unit resources in safety/time-critical systems , 2007, Int. J. Inf. Commun. Technol..

[67]  Sally A. McKee,et al.  Dynamic Access Ordering for Streamed Computations , 2000, IEEE Trans. Computers.

[68]  David W. Nellans,et al.  Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[69]  Xu Chen,et al.  Hardware Acceleration for Accurate Stereo Vision System using Mini-Census Adaptive Support Region , 2013 .

[70]  Mateo Valero,et al.  Vector architectures: past, present and future , 1998, ICS '98.

[71]  Eduard Ayguadé,et al.  PAMS: Pattern Aware Memory System for embedded systems , 2014, 2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14).

[72]  Shahram Latifi,et al.  Future prospects of DRAM: emerging alternatives , 2012, Int. J. High Perform. Syst. Archit..

[73]  Wei Wu,et al.  FT64: Scientific Computing with Streams , 2007, HiPC.

[74]  Eduard Ayguadé,et al.  PPMC: A Programmable Pattern Based Memory Controller , 2012, ARC.

[75]  Peng Liu,et al.  An Efficient Architectural Design of Hardware Interface for Heterogeneous Multi-core System , 2011, NPC.

[76]  Francisco J. Cazorla,et al.  A dynamic scheduler for balancing HPC applications , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[77]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[78]  Jonathan Rose,et al.  VESPA: portable, scalable, and flexible FPGA-based vector processors , 2008, CASES '08.

[79]  Guy Lemieux,et al.  Vector Processing as a Soft Processor Accelerator , 2009, TRETS.

[80]  Ken Kennedy,et al.  Improving memory hierarchy performance for irregular applications , 1999, ICS '99.

[81]  Nikil D. Dutt,et al.  APEX: access pattern based memory architecture exploration , 2001, International Symposium on System Synthesis (IEEE Cat. No.01EX526).

[82]  Alice C. Parker,et al.  The high-level synthesis of digital systems , 1990, Proc. IEEE.

[83]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[84]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[85]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[86]  Jason Cong,et al.  CHARM: a composable heterogeneous accelerator-rich microprocessor , 2012, ISLPED '12.