Multi-threading in Uni-threaded Processor

This chapter introduces the concept of executing multiple incompatible loops in parallel and thereby enabling multi-threading in an efficient way in a VLIW processor. The proposed multi-threading is enabled by the use of a distributed instruction memory organization with a minimal hardware overhead. This forms one of the core contributions of this book. It also shows how the proposed instruction memory hierarchy extension can both improve performance as well as reduce the energy consumption compared to state-of-the-art simultaneous multi-threaded (SMT) architectures over various DSP benchmarks. The chapter also shows that the proposed architecture can be compiled for.

[1]  Diederik Verkest,et al.  Software Simultaneous Multi-Threading, a Technique to Exploit Task-Level Parallelism to Improve Instruction- and Data-Level Parallelism , 2006, PATMOS.

[2]  Erik Brockmeyer,et al.  Systematic Preprocessing of Data Dependent Constructs for Embedded Systems , 2005, PATMOS.

[3]  Stefanos Kaxiras,et al.  Comparing power consumption of an SMT and a CMP DSP for mobile phone workloads , 2001, CASES '01.

[4]  Gustavo de Veciana,et al.  Application-specific clustered VLIW datapaths: early exploration on a parameterized design space , 2002, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[5]  Frank Vahid,et al.  Synthesis of customized loop caches for core-based embedded systems , 2002, ICCAD 2002.

[6]  Luis Piñuel,et al.  Optimizing the memory bandwidth with loop morphing , 2004 .

[7]  Kurt Keutzer,et al.  Getting to the bottom of deep submicron II: a global wiring paradigm , 1999, ISPD '99.

[8]  Wen-mei W. Hwu,et al.  Enhancing loop buffering of media and telecommunications applications using low-overhead predication , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[9]  Aviral Shrivastava,et al.  An efficient compiler technique for code size reduction using reduced bit-width ISAs , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[10]  Thomas M. Conte,et al.  High-performance and low-cost dual-thread VLIW processor using Weld architecture paradigm , 2005, IEEE Transactions on Parallel and Distributed Systems.

[11]  Henk Corporaal,et al.  Clustered loop buffer organization for low energy VLIW embedded processors , 2005, IEEE Transactions on Computers.

[12]  Sumedh W. Sathaye,et al.  Instruction fetch mechanisms for VLIW architectures with compressed encodings , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[13]  William J. Dally,et al.  Register organization for media processing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[14]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[15]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[16]  Sanjay V. Rajopadhye,et al.  Generation of Efficient Nested Loops from Polyhedra , 2000, International Journal of Parallel Programming.

[17]  Mahmut T. Kandemir,et al.  Compiler-directed scratch pad memory optimization for embedded multiprocessors , 2004, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[18]  Peter Marwedel,et al.  Assigning program and data objects to scratchpad for energy reduction , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[19]  小林 悠記 Low power design method for embedded systems using VLIW processor , 2007 .