A first-order fine-grained multithreaded throughput model

Analytical modeling is an alternative to detailed performance simulation with the potential to shorten the development cycle and provide additional insights. This paper proposes analytical models for predicting the cache contention and throughput of heavily multithreaded architectures such as Sun Microsystems' Niagara. First, it proposes a novel probabilistic model to accurately predict the number of extra cache misses due to cache contention for significantly larger numbers of threads than possible with prior analytical cache contention models. Then it presents a Markov chain model for analytically estimating the throughput of multicore, fine-grained multithreaded architectures. The Markov model uses the number of stalled threads as the states and calculates transition probabilities based upon the rates and latencies of events stalling a thread. By modeling the overlapping of the stalls among threads and taking account of cache contention our models accurately predict system throughput obtained from a cycle-accurate performance simulator with an average error of 7.9%. We also demonstrate the application of our model to a design problem—optimizing the design of fine-grained multithreaded chip multiprocessors for application-specific workloads—yielding the same result as detailed simulations 65 times faster. Moreover, this paper shows that our models accurately predict cache contention and throughput trends across varying workloads on real hardware—a Sun Fire T1000 server.

[1]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[2]  John Paul Shen,et al.  Theoretical modeling of superscalar processor performance , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[3]  C. J. Stone,et al.  Introduction to Stochastic Processes , 1972 .

[4]  John Wawrzynek,et al.  Research accelerator for multiple processors , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[5]  David Keppel,et al.  Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[6]  Mary K. Vernon,et al.  Parallel program performance prediction using deterministic task graph analysis , 2004, TOCS.

[7]  Mary K. Vernon,et al.  Toward a multicore architecture for real-time ray-tracing , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[8]  D. B. Davis,et al.  Sun Microsystems Inc. , 1993 .

[9]  Li Zhao,et al.  Exploring Large-Scale CMP Architectures Using ManySim , 2007, IEEE Micro.

[10]  James E. Smith,et al.  Automated design of application specific superscalar processors: an analytical approach , 2007, ISCA '07.

[11]  Harold S. Stone,et al.  Footprints in the cache , 1986, SIGMETRICS '86/PERFORMANCE '86.

[12]  Mauricio J. Serrano,et al.  Performance estimation of multistreamed, superscalar processors , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[13]  Babak Falsafi,et al.  Modeling cost/performance of a parallel computer simulator , 1997, TOMC.

[14]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[15]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[16]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[17]  John Paul Shen,et al.  A framework for statistical modeling of superscalar processor performance , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[18]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[19]  Stéphan Jourdan,et al.  An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors , 2004, International Journal of Parallel Programming.

[20]  Tor M. Aamodt,et al.  Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[21]  James E. Smith,et al.  The future of simulation: a field of dreams , 2006, Computer.

[22]  Christoforos E. Kozyrakis,et al.  RAMP: Research Accelerator for Multiple Processors , 2007, IEEE Micro.

[23]  Mark Horowitz,et al.  An analytical cache model , 1989, TOCS.

[24]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[25]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[26]  John L. Hennessy,et al.  Efficient performance prediction for modern microprocessors , 2000, SIGMETRICS '00.

[27]  Dam Sunwoo,et al.  FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators , 2007, MICRO.