A Scalable, Multi-thread, Multi-issue Array Processor Architecture for DSP Applications Based on Extended Tomasulo Scheme

A scalable, distributed micro-architecture is presented that emphasizes on high performance computing for digital signal processing applications by combining high frequency design techniques with a very high degree of parallel processing on a chip. The architecture is based on a superscalar processor model with out-of-order execution, that supports specialized, complex DSP function units, and simultaneous instruction issue from multiple independent threads (SMT). Consequent application of fine clustering reduces the cycle-time for wire-sensitive building blocks of the processor like the register file and leads to a distributed architecture model, where independent thread processing units, ALUs, registers files and memories are distributed across the chip and communicate with each other by special networks, forming a network-on-a-chip (NOC) [1]. The communication protocol is a modified version of Tomasulo's scheme [2], that was extended to eliminate all central control structures for the data flow and to support multithreading. The performance of the architecture is scalable with both the number of function units and the number of thread units without having any impact on the processors cycle-time.

[1]  Ruby B. Lee Accelerating multimedia with enhanced microprocessors , 1995, IEEE Micro.

[2]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[3]  Minerva M. Yeung,et al.  The impact of SMT/SMP designs on multimedia software engineering - a workload analysis study , 2002, Fourth International Symposium on Multimedia Software Engineering, 2002. Proceedings..

[4]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[5]  Peter Pirsch,et al.  Instruction Set Extensions for MPEG-4 Video , 1999, J. VLSI Signal Process..

[6]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[7]  Peter Pirsch,et al.  HiBRID-SoC: a multi-core system-on-chip architecture for multimedia signal processing applications , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[8]  James William Stroming VLSI Architectures for Mpeg-4 Video Object Decoding , 1998 .

[9]  Luca Benini,et al.  Networks on Chips : A New SoC Paradigm , 2022 .

[10]  Yale N. Patt,et al.  Select-free instruction scheduling logic , 2001, MICRO.

[11]  Frank Vahid,et al.  The Softening of Hardware , 2003, Computer.

[12]  Peter Pirsch,et al.  An Algorithm-Hardware-System Approach to VLIW Multimedia Processors , 1998, J. VLSI Signal Process..

[13]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[14]  Peter Pirsch,et al.  VLSI architectures for MPEG , 2003, 2003 International Symposium on VLSI Technology, Systems and Applications. Proceedings of Technical Papers. (IEEE Cat. No.03TH8672).

[15]  Henk Corporaal Microprocessor architectures - from VLIW to TTA , 1997 .

[16]  E. Sackinger,et al.  A single-chip, 1.6-billion, 16-b MAC/s multiprocessor DSP , 2000, IEEE Journal of Solid-State Circuits.

[17]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[18]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[19]  Mircea R. Stan,et al.  5-GHz 32-bit Integer Execution Core in 130-nm Dual-VT CMOS , 2001 .

[20]  J. Tschanz,et al.  A 25 GHz 32 b integer-execution core in 130 nm dual-V/sub T/ CMOS , 2002, 2002 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.02CH37315).

[21]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[22]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[23]  Norman P. Jouppi,et al.  The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[24]  Itsujiro Arita,et al.  Revisiting Direct Tag Search Algorithm on Superscalar Processors , 2001 .

[25]  Karthikeyan Sankaralingam,et al.  A design space evaluation of grid processor architectures , 2001, MICRO.

[26]  H. Zhang,et al.  A 1 V heterogeneous reconfigurable processor IC for baseband wireless applications , 2000, 2000 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.00CH37056).

[27]  James E. Smith,et al.  Instruction Issue Logic in Pipelined Supercomputers , 1984, IEEE Transactions on Computers.

[28]  富田 眞治 20世紀の名著名論:R. M. Tomasulo : An Efficient Algorithm for Exploiting Multiple Arithmetic Units , 2004 .

[29]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[30]  Peter Pirsch,et al.  Realization of a programmable parallel DSP for high performance image processing applications , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[31]  Yervant Zorian,et al.  2001 Technology Roadmap for Semiconductors , 2002, Computer.

[32]  R. Nagarajan,et al.  A design space evaluation of grid processor architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[33]  S. Aign,et al.  Overview of the MPEG-4 Standard and Error Resilience Investigations , 1998 .

[34]  Alan Jay Smith,et al.  Measuring the Performance of Multimedia Instruction Sets , 2002, IEEE Trans. Computers.

[35]  Peter Pirsch,et al.  Multicore system-on-chip architecture for MPEG-4 streaming video , 2002, IEEE Trans. Circuits Syst. Video Technol..