A Distributed, Simultaneously Multi-Threaded (SMT) Processor with Clustered Scheduling Windows for Scalable DSP Performance

A scalable, distributed, processor architecture is presented that emphasizes on high performance computing for digital signal processing applications by combining high frequency design techniques with a very high degree of parallel processing on a chip. The architecture is based on a superscalar processor model with a modified Tomasulo scheme that was extended to eliminate all central control structures for the data flow and to support simultaneous instruction issue from multiple independent threads [simultaneously multi-threaded (SMT)]. Consequent application of fine clustering reduces the cycle-time for wire-sensitive building blocks of the processor like the register file and the scheduling window and leads to a distributed architecture model, where independent thread processing units, arithmetic logic units, registers files and memories are distributed across the chip and communicate with each other by special network. A special communication protocol replaces broadcasting and associative compare of destination tags in a centralised instruction scheduler with explicit operand transfer instructions, thus decentralizing the control of the data flow to the greatest extent. As a result, the processor cycle time does neither depend on the issue bandwidth of a single thread nor on the execution bandwidth of the SMT processor. This makes the performance of the architecture scalable with both the number of function and the number of thread units without having any impact on the processors cycle-time. Performance and scalability of the proposed microarchitecture is demonstrated with critical signal processing kernels from the MPEG-4 video coding standard on a cycle-true simulator.

[1]  Gurindar S. Sohi,et al.  Instruction Issue Logic for High-Performance Interruptible, Multiple Functional Unit, Pipelines Computers , 1990, IEEE Trans. Computers.

[2]  Peter Pirsch,et al.  A scalable, clustered SMT processor for digital signal processing , 2004, SIGARCH Comput. Archit. News.

[3]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[4]  Peter Pirsch,et al.  Realization of a programmable parallel DSP for high performance image processing applications , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[5]  Hans Mulder,et al.  Introducing the IA-64 Architecture , 2000, IEEE Micro.

[6]  Mladen Berekovic,et al.  A Scalable, Multi-thread, Multi-issue Array Processor Architecture for DSP Applications Based on Extended Tomasulo Scheme , 2006, SAMOS.

[7]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[8]  Mladen Berekovic,et al.  The MPEG-4 Multimedia Coding Standard: Algorithms, Architectures and Applications , 1999, J. VLSI Signal Process..

[9]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[10]  Peter Pirsch,et al.  An Algorithm-Hardware-System Approach to VLIW Multimedia Processors , 1998, J. VLSI Signal Process..

[11]  Chris Wilkerson,et al.  Hierarchical scheduling windows , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[12]  Geoffrey Brown,et al.  Lx: a technology platform for customizable VLIW embedded processing , 2000, ISCA '00.

[13]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[14]  Norman P. Jouppi,et al.  CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[15]  William J. Dally,et al.  Register organization for media processing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[16]  Peter Pirsch,et al.  Instruction Set Extensions for MPEG-4 Video , 1999, J. VLSI Signal Process..

[17]  Henk Corporaal,et al.  Software pipelining for transport-triggered architectures , 1991, MICRO 24.

[18]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[19]  Norman P. Jouppi,et al.  The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[20]  Henk Corporaal Microprocessor architectures - from VLIW to TTA , 1997 .

[21]  Ruby B. Lee Accelerating multimedia with enhanced microprocessors , 1995, IEEE Micro.

[22]  Nobu Matsumoto,et al.  A single-chip MPEG-2 codec based on customizable media embedded processor , 2003 .

[23]  Yervant Zorian,et al.  2001 Technology Roadmap for Semiconductors , 2002, Computer.

[24]  Jan M. Rabaey,et al.  Ultra-low-power domain-specific multimedia processors , 1996, VLSI Signal Processing, IX.

[25]  R. Nagarajan,et al.  A design space evaluation of grid processor architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[26]  Viresh Rustagi,et al.  Calisto: A Low-Power Single-Chip Multiprocessor Communications Platform , 2003, IEEE Micro.

[27]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[28]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[29]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[30]  Cameron McNairy,et al.  Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[31]  Rajeev Balasubramonian,et al.  Reducing the complexity of the register file in dynamic superscalar processors , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[32]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[33]  Peter Pirsch,et al.  HiBRID-SoC: a multi-core system-on-chip architecture for multimedia signal processing applications , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[34]  Peter Pirsch,et al.  VLSI architectures for MPEG , 2003, 2003 International Symposium on VLSI Technology, Systems and Applications. Proceedings of Technical Papers. (IEEE Cat. No.03TH8672).

[35]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[36]  Wolfram Sauer,et al.  A 1.8-GHz instruction window buffer for an out-of-order microprocessor core , 2001 .

[37]  Richard P. Martin,et al.  Assessing Fast Network Interfaces , 1996, IEEE Micro.

[38]  David A. Koufaty,et al.  Hyperthreading Technology in the Netburst Microarchitecture , 2003, IEEE Micro.

[39]  Peter Pirsch,et al.  Multicore system-on-chip architecture for MPEG-4 streaming video , 2002, IEEE Trans. Circuits Syst. Video Technol..

[40]  Bryan D. Ackland,et al.  A single-chip 1.6 billion 16-b MAC/s multiprocessor DSP , 1999 .

[41]  J. Tschanz,et al.  A 25 GHz 32 b integer-execution core in 130 nm dual-V/sub T/ CMOS , 2002, 2002 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.02CH37315).

[42]  Peter Pirsch,et al.  A platform-independent methodology for performance estimation of streaming media applications , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[43]  Frank Vahid,et al.  The Softening of Hardware , 2003, Computer.

[44]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[45]  Minerva M. Yeung,et al.  The impact of SMT/SMP designs on multimedia software engineering - a workload analysis study , 2002, Fourth International Symposium on Multimedia Software Engineering, 2002. Proceedings..

[46]  Stamatis Vassiliadis,et al.  Sandblaster Low-Power Multithreaded SDR Baseband Processor , 2004 .

[47]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[48]  Theo Ungerer,et al.  MPEG-2 video decompression on simultaneous multithreaded multimedia processors , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[49]  H. Zhang,et al.  A 1 V heterogeneous reconfigurable processor IC for baseband wireless applications , 2000, 2000 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.00CH37056).

[50]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.