SIMD defragmenter: efficient ILP realization on data-parallel architectures

Single-instruction multiple-data (SIMD) accelerators provide an energy-efficient platform to scale the performance of mobile systems while still retaining post-programmability. The central challenge is translating the parallel resources of the SIMD hardware into real application performance. In scientific applications, automatic vectorization techniques have proven quite effective at extracting large levels of data-level parallelism (DLP). However, vectorization is often much less effective for media applications due to low trip count loops, complex control flow, and non-uniform execution behavior. As a result, SIMD lanes remain idle due to insufficient DLP. To attack this problem, this paper proposes a new vectorization pass called SIMD Defragmenter to uncover hidden DLP that lurks below the surface in the form of instruction-level parallelism (ILP). The difficulty is managing the data packing/unpacking overhead that can easily exceed the benefits gained through SIMD execution. The SIMD degragmenter overcomes this problem by identifying groups of compatible instructions (subgraphs) that can be executed in parallel across the SIMD lanes. By SIMDizing in bulk at the subgraph level, packing/unpacking overhead is minimized. On a 16-lane SIMD processor, experimental results show that SIMD defragmentation achieves a mean 1.6x speedup over traditional loop vectorization and a 31% gain over prior research approaches for converting ILP to DLP.

[1]  Albert Cohen,et al.  Vapor SIMD: Auto-vectorize once, run everywhere , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[2]  Scott A. Mahlke,et al.  AnySP: Anytime Anywhere Anyway Signal Processing , 2010, IEEE Micro.

[3]  Scott A. Mahlke,et al.  Scalable subgraph mapping for acyclic computation accelerators , 2006, CASES '06.

[4]  Ulrich Ramacher,et al.  A programmable platform for software-defined radio , 2003, Proceedings. 2003 International Symposium on System-on-Chip (IEEE Cat. No.03EX748).

[5]  Philip H. Sweany,et al.  Global register partitioning , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[6]  Scott Mahlke,et al.  Processor acceleration through automated instruction set customization , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[7]  Hyunseok Lee,et al.  SODA: A Low-power Architecture For Software Radio , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[8]  Kees Moerman,et al.  Vector Processing as an Enabler for Software-Defined Radio in Handheld Devices , 2005, EURASIP J. Adv. Signal Process..

[9]  Hyunseok Lee,et al.  SODA: A High-Performance DSP Architecture for Software-Defined Radio , 2007, IEEE Micro.

[10]  Pradip Bose,et al.  A case for guarded power gating for multi-core processors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[11]  Jaewook Shin,et al.  Compiler-controlled caching in superword register files for multimedia extension architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[12]  Scott A. Mahlke,et al.  An architecture framework for transparent instruction set customization in embedded processors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[13]  Nikil D. Dutt,et al.  Partitioned register files for VLIWs: a preliminary analysis of tradeoffs , 1992, MICRO 25.

[14]  Lizy Kurian John,et al.  Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements , 2003, IEEE Trans. Computers.

[15]  Naehyuck Chang,et al.  Accurate modeling and calculation of delay and energy overheads of dynamic voltage scaling in modern high-performance microprocessors , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[16]  Antonio González,et al.  Graph-partitioning based instruction scheduling for clustered processors , 2001, MICRO.

[17]  Nathan Clark,et al.  An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors , 2005, ISCA 2005.

[18]  Scott A. Mahlke,et al.  Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures , 2006, CASES '06.

[19]  Scott A. Mahlke,et al.  From SODA to scotch: The evolution of a wireless baseband processor , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[20]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[21]  A. Gonzalez,et al.  Graph-partitioning based instruction scheduling for clustered processors , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[22]  Scott A. Mahlke,et al.  Region-based hierarchical operation partitioning for multicluster processors , 2003, PLDI '03.

[23]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24]  Emmett Witchel,et al.  Increasing and detecting memory address congruence , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[25]  Pradip Bose,et al.  Dynamic power gating with quality guarantees , 2009, ISLPED.

[26]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[27]  Scott A. Mahlke,et al.  Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[28]  Mayan Moudgill,et al.  The Sandbridge Sandblaster Communications Processor , 2004 .

[29]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[30]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[31]  Pradip Bose,et al.  Microarchitectural techniques for power gating of execution units , 2004, Proceedings of the 2004 International Symposium on Low Power Electronics and Design (IEEE Cat. No.04TH8758).

[32]  Vivek Sarkar,et al.  Efficient Selection of Vector Instructions Using Dynamic Programming , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[33]  Scott A. Mahlke,et al.  Edge-centric modulo scheduling for coarse-grained reconfigurable architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).