Scalable superscalar processing

In this dissertation, it is demonstrated that there is sufficient parallelism in ordinary programs to scale the issue width of the out-of-order issue superscalar processors provided that processors employ very large instruction windows and near-perfect dynamic memory disambiguation. The state-of-the-art instruction wake-up and dynamic memory disambiguation techniques are thoroughly analyzed and it is demonstrated that they do not scale beyond an issue width of 8. This dissertation proposes alternative techniques for dynamic memory disambiguation and instruction wake-up mechanisms that scale well upto an issue width of 32. Large instruction windows can be implemented without adversely effecting the processor clock using the concept of dynamically generating a dependence graph which is then used to directly wake-up instructions which are shelved in the reorder buffer. The resulting microarchitecture is called the Direct Wake-up Microarchitecture ( DWMA ). DWMA implements very large instruction windows with little loss in performance compared to an ideal central window implementation of the same size. For example, The DWMA processor achieves 84%, 79% and 67% of the performance of an ideal central window processor at issue widths of 8, 16 and 32 instructions, respectively. The solution to scalable dynamic memory disambiguation is based on a novel memory order violation detection mechanism which allows full out-of-order issuing of the store instructions in the instruction window. As a result, memory dependence predictors which rely only on the program counter values to make their predictions can be effectively employed without introducing false memory dependencies. Using this technique together with the store-set memory disambiguator a processor can achieve 100%, 96%, and 85% of the performance of a processor that embodies a “perfect” memory disambiguator at issue widths of 8, 16, and 32 instructions, respectively. Evaluation of both the existing techniques as well as the new ones demanded development of many simulators. As a result, a new domain specific language called Architecture Description Language ( ADL ) has been designed and implemented in a powerful simulation system called the Flexible Architecture Simulation Tool ( FAST ). FAST has been used to generate all the cycle-level accurate simulators required for this thesis.