Reconfigurable Multi-core Architecture -- A Plausible Solution to the Von Neumann Performance Bottleneck

The ill-famed von Neumann bottleneck has been the main performance hurdle since the invention of computers. Although several techniques such as separate data/instruction caches, branch prediction, and parallel computing have been proposed and improved efficiency, the throughput bottleneck between CPU and memory is still very much there. We propose a novel reconfigurable multi-core architecture (RMA) to address this issue via the dynamic allocation of heterogeneous computing resources and distributed memory. We show how this is feasible with the state-of-the-art technologies of dynamic partial reconfiguration of hardware resources and runtime operating system configuration. Experiments and analysis show how RMA alleviates the performance bottleneck.

[1]  Dong Li,et al.  The tradeoffs of fused memory hierarchies in heterogeneous computing architectures , 2012, CF '12.

[2]  Chris Fallin,et al.  Parallel application memory scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Pedro C. Diniz,et al.  Compiler reuse analysis for the mapping of data in FPGAs with RAM blocks , 2004, Proceedings. 2004 IEEE International Conference on Field- Programmable Technology (IEEE Cat. No.04EX921).

[4]  Nikil D. Dutt,et al.  On-chip vs. off-chip memory: the data partitioning problem in embedded processor-based systems , 2000, TODE.

[5]  Ann Gordon-Ross,et al.  An application classification guided cache tuning heuristic for multi-core architectures , 2012, 17th Asia and South Pacific Design Automation Conference.

[6]  Gang Wang,et al.  Data Partitioning for Reconfigurable Architectures with Distributed Block RAM , 2005 .

[7]  Maya Gokhale,et al.  Automatic allocation of arrays to memories in FPGA processors with multiple memory banks , 1999, Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00375).

[8]  Nikil D. Dutt,et al.  Automatic tuning of two-level caches to embedded applications , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[9]  Pedro C. Diniz,et al.  Automatic synthesis of data storage and control structures for FPGA-based computing engines , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[10]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).