In our previously published research we discovered some very difficult to predict branches, called unbiased branches. Since the overall performance of modern processors is seriously affected by misprediction recovery, especially these difficult branches represent a source of important performance penalties. Our statistics show that about 28% of branches are dependent on critical Load instructions. Moreover, 5.61% of branches are unbiased and depend on critical Loads, too. In the same way, about 21% of branches depend on MUL/DIV instructions whereas 3.76% are unbiased and depend on MUL/DIV instructions. These dependences involve high-penalty mispredictions becoming serious performance obstacles and causing significant performance degradation in executing instructions from wrong paths. Therefore, the negative impact of (unbiased) branches over global performance should be seriously attenuated by anticipating the results of long-latency instructions, including critical Loads. On the other hand, hiding instructions' long latencies in a pipelined superscalar processor represents an important challenge itself. We developed a superscalar architecture that selectively anticipates the values produced by high-latency instructions. In this work we are focusing on multiply, division and loads with miss in L1 data cache, implementing a dynamic instruction reuse scheme for the MUL/DIV instructions and a simple last value predictor for the critical Load instructions. Our improved superscalar architecture achieves an average IPC speedup of 3.5% on the integer SPEC 2000 benchmarks, of 23.6% on the floating-point benchmarks, and an improvement in energy-delay product (EDP) of 6.2% and 34.5%, respectively. We also quantified the impact of our developed selective instruction reuse and value prediction techniques in a simultaneous multithreaded architecture (SMT) that implies per thread reuse buffers and load value prediction tables. Our simulation results showed that the best improvements on the SPEC integer applications have been obtained with 2 threads: an IPC speedup of 5.95% and an EDP gain of 10.44%. Although, on the SPEC floating-point programs, we obtained the highest improvements with the enhanced superscalar architecture, the SMT with 3 threads also provides an important IPC speedup of 16.51% and an EDP gain of 25.94%.
[1]
Shlomo Weiss,et al.
Reexecution and Selective Reuse in Checkpoint Processors
,
2009,
Trans. High Perform. Embed. Archit. Compil..
[2]
Onur Mutlu,et al.
Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses
,
2006,
IEEE Transactions on Computers.
[3]
Mark Horowitz,et al.
Energy dissipation in general purpose microprocessors
,
1996,
IEEE J. Solid State Circuits.
[4]
Margaret Martonosi,et al.
Wattch: a framework for architectural-level power analysis and optimizations
,
2000,
Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[5]
Daniel Citron.
Revisiting Instruction Level Reuse
,
.
[6]
Chia-Hung Liao,et al.
Exploiting speculative value reuse using value prediction
,
2002
.
[7]
Chung-Ping Chung,et al.
Early load: Hiding load latency in deep pipeline processor
,
2008,
2008 13th Asia-Pacific Computer Systems Architecture Conference.
[8]
Arpad Gellert,et al.
Understanding Prediction Limits Through Unbiased Branches
,
2006,
Asia-Pacific Computer Systems Architecture Conference.
[9]
Mikko H. Lipasti,et al.
Value locality and load value prediction
,
1996,
ASPLOS VII.
[10]
S. E. Richardson.
Exploiting trivial and redundant computation
,
1993,
Proceedings of IEEE 11th Symposium on Computer Arithmetic.
[11]
G.S. Sohi,et al.
Dynamic instruction reuse
,
1997,
ISCA '97.
[12]
Arpad Gellert,et al.
Unbiased Branches: An Open Problem
,
2007,
Asia-Pacific Computer Systems Architecture Conference.
[13]
Todd M. Austin,et al.
The SimpleScalar tool set, version 2.0
,
1997,
CARN.