MIAOW - An open source RTL implementation of a GPGPU

Graphic Processing Unit (GPU) based general purpose computing is developing as a viable alternative to CPU based computing in many domains. In this paper, we introduce MIAOW (Many-core Integrated Accelerator Of Wisconsin), an open source RTL implementation of the AMD Southern Islands GPGPU ISA, capable of running unmodified OpenCL-based applications. We present our design motivated by our goals to create a realistic, flexible, OpenCL compatible GPGPU, capable of emulating a full system. We demonstrate that MIAOW enables disruptive and transformative research and has the potential to bring all of the benefits of open source development to GPUs in real products in the long term.

[1]  Richard W. Vuduc,et al.  A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[2]  Hyeran Jeon,et al.  Warped-DMR: Light-weight Error Detection for GPGPU , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[3]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[4]  Karthikeyan Sankaralingam,et al.  MIAOW - An open source RTL implementation of a GPGPU , 2015, COOL Chips.

[5]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[6]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[7]  Karthikeyan Sankaralingam,et al.  Understanding the impact of gate-level physical reliability effects on whole program execution , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[8]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[9]  Josep Torrellas,et al.  ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[10]  Ben Sander,et al.  Applying AMD's Kaveri APU for heterogeneous computing , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[11]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[12]  Paolo Bernardi,et al.  Hardware-accelerated path-delay fault grading of functional test programs for processor-based systems , 2007, GLSVLSI '07.

[13]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[14]  Mahmut T. Kandemir,et al.  Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.

[15]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[16]  Bin Li,et al.  Architecture comparisons between Nvidia and ATI GPUs: Computation parallelism and data communications , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[17]  Karthikeyan Sankaralingam,et al.  iGPU: Exception support and speculative execution on GPUs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[18]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[19]  Josep Torrellas,et al.  Blueshift: Designing processors for timing speculation from the ground up. , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[20]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[22]  Mohammad Abdel-Majeed,et al.  Warped register file: A power efficient register file for GPGPUs , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[23]  Karthikeyan Sankaralingam,et al.  A fast and highly accurate path delay emulation framework for logic-emulation of timing speculation , 2010, 2010 IEEE International Test Conference.

[24]  Hyesoon Kim,et al.  Performance Analysis and Tuning for General Purpose Graphics Processing Units , 2012 .

[25]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[26]  Mattan Erez,et al.  CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[27]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[28]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[29]  John Sartori,et al.  Power balanced pipelines , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[30]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[31]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[32]  David Blaauw,et al.  Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, MICRO.

[33]  Todd M. Austin,et al.  CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework , 2008, 2008 IEEE International Conference on Computer Design.

[34]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[35]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[36]  John Y. Chen,et al.  GPU technology trends and future requirements , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[37]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[38]  Tor M. Aamodt,et al.  Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[39]  Christopher Batten,et al.  Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators , 2013, ACM Trans. Comput. Syst..

[40]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[41]  Carlos González,et al.  ATTILA: a cycle-level execution-driven simulator for modern GPU architectures , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[42]  Xin Fu,et al.  Analyzing soft-error vulnerability on GPGPU microarchitecture , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[43]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[44]  Luigi Carro,et al.  Neutron radiation test of graphic processing units , 2012, 2012 IEEE 18th International On-Line Testing Symposium (IOLTS).

[45]  Karthikeyan Sankaralingam,et al.  Virtually-aged sampling DMR: Unifying circuit failure prediction and circuit failure detection , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[46]  Daniel J. Sorin,et al.  Exploring memory consistency for massively-threaded throughput-oriented processors , 2013, ISCA.

[47]  Mike O'Connor,et al.  Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[48]  Eric Rotenberg,et al.  FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).