The scheme of prediction of taken or not taken requires the calculation of the branch target.
This calculation takes one cycle, meaning that taken branches will have a 1-cycle penalty.
Theoretically, delayed branches have zero delay, but they include the following disadvantages:
Branch delay can increase to multiple delay slots in deeper pipelines.
Branch delay slots must be filled with useful instructions or nops.
Another approach to achieve zero-delay is to use a branch target buffer, which is a structure that caches the destination program counter or destination instruction for a branch.
The figure shows the structure of branch target buffer and branch prediction buffer, which will be explained in the next slide.
Branch target buffer is usually organized as a cache with tags, making it more costly than a simple prediction buffer, which uses a small memory instead.
The approach of using a branch target buffer works as follows:
Check the PC to see if the instruction being fetched is a branch.
Store the branch target address in a branch buffer in the IF stage.
If branch is predicted taken,
then “next PC = branch target fetched from branch target buffer”
else “next PC = PC + 4”
The prediction bits are to predict whether branches are taken or not taken.
They are dynamically determined by the hardware.