A Cost-Effective Branch Target Buffer with a Two-Level Table Organization

Accurate branch prediction is required in microprocessors. Branch target addresses are generally predicted with BTB (Branch Target Bu er) [2]. To achieve high prediction accuracy, BTB requires many entries, thus considerably increasing the amount of hardware. This paper proposes a new scheme called a two-level table scheme to reduce the amount of hardware of BTB by utilizing the characteristics of branch target addresses. Figure 1 shows the organization of our two-level table scheme. Our scheme consists of two tables. Each entry of the rst table contains the low-order bits of a branch target address, a tag, and a far bit (F). On the other hand, each entry of the second table contains the high-order bits of a branch target address and a tag. Prediction is performed as follows. Both tables are simultaneously accessed with the branch PC as indices. If the obtained F bit is zero, the high-order part of the branch PC is selected, and is concatenated to the obtained low-order part of the target address from the rst table. Otherwise, the obtained high-order part of the target address from the second table is selected, and is concatenated to the low-order part of the target address from the rst table. Figure 2 shows the evaluation results. All of our results have been collected with our trace-driven simulator. To serve as benchmarks, we have used eight programs from the SPEC95 integer suite. The instruction address length is 30 bits. The tag reduction scheme proposed by Fagin and Russeell [1] is used. The baseline conventional BTB consists of a 2048-entry, 2-way set associative table. The evaluated two-level scheme has the BTB which contains a 2048-entry, 2-way set associative rst table and a 256entry, 4-way set associative second table. We evaluated branch prediction accuracy with various T bits. As T bits become small, hardware cost reduces and the branch prediction accuracy can degrade. As shown in Figure 2, our two-level scheme successfully reduces the hardware cost with little reduction of prediction accuracy under the conventional scheme. For example, our scheme reduces the hardware cost by 40% with only 0.1% degradation from the baseline BTB with the hardware cost of 66 Kbits. Also, our scheme slowly degrades the prediction accuracy as the given hardware cost becomes small. This indicates that our scheme is tolerable for various applications that include more branches with long distances.