Automated design of application-specific superscalar processors

Automated design of superscalar processors can provide future system-on-chip (SOC) designers with a turn-key method of generating superscalar processors that are Pareto-optimal in terms of performance, energy consumption, and area for the target application program(s). Unfortunately, current optimization methods are based on time-consuming cycle-accurate simulation, unsuitable for analysis of hundreds of thousands of design options that is required to arrive at Pareto-optimal designs. This dissertation bridges the gap between a large design space of superscalar processors and the inability of cycle-accurate simulation to analyze a large design space, by providing a computationally and conceptually simple analytical method for generating Pareto-optimal superscalar processor designs. The proposed and evaluated analytical method consists of three parts: (1) a method for analytically estimating the performance in terms a cycles-per-instruction (CPI) using the application program statistics and the superscalar processor parameters, (2) a method of analytically estimating various energy consuming activities using the application program statistics and the superscalar processor parameters, and (3) a search method for systematically finding the Pareto-optimal designs. At the heart of these three parts are analytical equations that model the fundamental governing principles of superscalar processors. These equations are simple yet accurate enough to quickly find the Pareto-optimal superscalar processor designs. In addition to the computational simplicity, the analytical design optimization method is conceptually simple. It gives clear design guidance by providing (1) the ability to visualize the performance degrading events, such as branch mispredictions and instruction cache misses, (2) the ability to analyze energy consuming activity at the microarchitecture level, and (3) the cause-and-effect relationship between superscalar core design parameters. The conceptual simplicity allows a quick grasp of the analytical method and also provides key insights into the inner workings of superscalar processors. Overall the proposed analytical design optimization method can provide future SOC designers with an automated approach for generating Pareto-optimal application-specific superscalar processors with minimal design time and effort.

[1]  Stéphan Jourdan,et al.  Exploring instruction-fetch bandwidth requirement in wide-issue superscalar processors , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[2]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS III.

[3]  Jared L. Cohon,et al.  Multiobjective programming and planning , 2004 .

[4]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[5]  MI CRO OSSOR MIPS R10000 Uses Decoupled Architecture: 10/24/94 , 1994 .

[6]  Pong-Fei Lu,et al.  Physical design of a fourth-generation POWER GHz microprocessor , 2001, 2001 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. ISSCC (Cat. No.01CH37177).

[7]  Frederic T. Chong,et al.  HLS: combining statistical and symbolic simulation to guide microprocessor designs , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[8]  Farid N. Najm,et al.  Accurate power estimation for large sequential circuits , 1997, ICCAD 1997.

[9]  Norman P. Jouppi,et al.  The Nonuniform Distribution of Instruction-Level and Machine Parallelism and Its Effect on Performance , 1989, IEEE Trans. Computers.

[10]  Emilio L. Zapata,et al.  Automatic analytical modeling for the estimation of cache misses , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[11]  Burzin A. Patel,et al.  Optimization of instruction fetch mechanisms for high issue rates , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[12]  Jing-Yang Jou,et al.  A power modeling and characterization method for the CMOS standard cell library , 1996, ICCAD 1996.

[13]  James R. Larus,et al.  Wisconsin Architectural Research Tool Set , 1993, CARN.

[14]  Wayne L. Winston,et al.  Microsoft Excel Data Analysis and Business Modeling , 2004 .

[15]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[16]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[17]  Jing-Yang Jou,et al.  A power modeling and characterization method for the CMOS standard cell library , 1996, ICCAD.

[18]  Roland E. Wunderlich,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[19]  Thomas R. Puzak,et al.  Optimum power/performance pipeline depth , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[20]  Lieven Eeckhout,et al.  Accurate Statistical Workload Modeling. , 2002 .

[21]  Farid N. Najm,et al.  High-level area and power estimation for VLSI circuits , 1997, 1997 Proceedings of IEEE International Conference on Computer Aided Design (ICCAD).

[22]  Rastislav Bodík,et al.  Interaction cost and shotgun profiling , 2004, TACO.

[23]  Lieven Eeckhout,et al.  Designing Computer Architecture Research Workloads , 2003, Computer.

[24]  Tejas Karkhanis,et al.  A Day in the Life of a Data Cache Miss , 2002 .

[25]  Josep Llosa,et al.  A fast and accurate framework to analyze and optimize cache memory behavior , 2004, TOPL.

[26]  Douglas M. Hawkins,et al.  Characterizing and comparing prevailing simulation techniques , 2005, 11th International Symposium on High-Performance Computer Architecture.

[27]  Kevin J. Nowka,et al.  The design and application of the PowerPC 405LP energy-efficient system-on-a-chip , 2003, IBM J. Res. Dev..

[28]  Michael J. Flynn,et al.  Computer Architecture: Pipelined and Parallel Processor Design , 1995 .

[29]  Todd M. Austin,et al.  Performance Simulation Tools , 2002, Computer.

[30]  Sartaj Sahni,et al.  Simulated Annealing and Combinatorial Optimization , 1986, DAC 1986.

[31]  James E. Smith,et al.  Characterizing the branch misprediction penalty , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[32]  Eby G. Friedman,et al.  Sleep switch dual threshold Voltage domino logic with reduced standby leakage current , 2004, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[33]  Ravi Mahajan,et al.  The Evolution of Microprocessor Packaging , 2000 .

[34]  Kevin B. Theobald,et al.  On the limits of program parallelism and its smoothability , 1992, MICRO 1992.

[35]  Farid N. Najm,et al.  A survey of power estimation techniques in VLSI circuits , 1994, IEEE Trans. Very Large Scale Integr. Syst..

[36]  James E. Smith,et al.  Saving energy with just in time instruction delivery , 2002, Proceedings of the International Symposium on Low Power Electronics and Design.

[37]  John Paul Shen,et al.  Theoretical modeling of superscalar processor performance , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[38]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[39]  Thomas Martin Conte,et al.  Systematic Computer Architecture Prototyping , 1992 .

[40]  M. Locatelli Simulated Annealing Algorithms for Continuous Global Optimization , 2002 .

[41]  T. Puzak,et al.  The optimum pipeline depth for a microprocessor , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[42]  Victor V. Zyuban,et al.  Application of STD to latch-power estimation , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[43]  Jingling Xue,et al.  Efficient and accurate analytical modeling of whole-program data cache behavior , 2004, IEEE Transactions on Computers.

[44]  Rastislav Bodík,et al.  Slack: maximizing performance under technological constraints , 2002, ISCA.

[45]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[46]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[47]  Michael J. Flynn,et al.  An area model for on-chip memories and its application , 1991 .

[48]  Edward S. Davidson,et al.  Computer system design using a hierarchical approach to performance evaluation , 1980, CACM.

[49]  Hirotaka Nakayama,et al.  Theory of Multiobjective Optimization , 1985 .

[50]  James E. Smith,et al.  Modeling superscalar processors via statistical simulation , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[51]  James E. Smith,et al.  Early-Stage Definition of LPX: A Low Power Issue-Execute Processor , 2002, PACS.

[52]  David H. Wolpert,et al.  The Mathematics of Search , 1996 .

[53]  Pradip Bose,et al.  Energy efficient co-adaptive instruction fetch and issue , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[54]  Farid N. Najm,et al.  Energy and peak-current per-cycle estimation at RTL , 2003, IEEE Trans. Very Large Scale Integr. Syst..

[55]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[56]  Christopher C. Skiscim,et al.  Optimization by simulated annealing: A preliminary computational study for the TSP , 1983, WSC '83.

[57]  Stéphan Jourdan,et al.  An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors , 2004, International Journal of Parallel Programming.

[58]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[59]  M. Potkonjak,et al.  Power efficient mediaprocessors: design space exploration , 1999, Proceedings 1999 Design Automation Conference (Cat. No. 99CH36361).

[60]  Edward M. Riseman,et al.  The Inhibition of Potential Parallelism by Conditional Jumps , 1972, IEEE Transactions on Computers.

[61]  Jiing-Yuan Lin,et al.  A Cell-based Power Estimation In Cmos Combinational Circuits , 1994, IEEE/ACM International Conference on Computer-Aided Design.

[62]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[63]  Rastislav Bodík,et al.  Focusing processor policies via critical-path prediction , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[64]  Dr. Zbigniew Michalewicz,et al.  How to Solve It: Modern Heuristics , 2004 .

[65]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[66]  M. A. Bhatti,et al.  Practical Optimization Methods with Mathematica Applications (& CD-ROM) , 2002, J. Oper. Res. Soc..

[67]  Edward D. Lazowska,et al.  Quantitative System Performance , 1985, Int. CMG Conference.

[68]  Brad Calder,et al.  Using SimPoint for accurate and efficient simulation , 2003, SIGMETRICS '03.

[69]  Graham R. Nudd,et al.  Analytical Modeling of Set-Associative Cache Behavior , 1999, IEEE Trans. Computers.

[70]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[71]  Stijn Eyerman,et al.  Efficient Design Space Exploration of High Performance Embedded Out-of-Order Processors , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[72]  Kevin Skadron,et al.  Temperature-aware microarchitecture: Modeling and implementation , 2004, TACO.

[73]  Erik Hagersten,et al.  StatCache: a probabilistic approach to efficient and accurate data locality analysis , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[74]  Mark Horowitz,et al.  An analytical cache model , 1989, TOCS.

[75]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[76]  G. Edward Suh,et al.  Analytical cache models with applications to cache partitioning , 2001, ICS '01.

[77]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[78]  D. DiMarco,et al.  A 16 GB/s, 0.18 /spl mu/m cache tile for integrated L2 caches from 256 KB to 2 MB , 2000, 2000 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No.00CH37103).

[79]  Theodore Antonakopoulos,et al.  An Instruction Throughput Model of Superscalar Processors , 2003 .