Slide 14.18: Zero-delay branches

Zero-Delay Branches

The scheme of prediction of taken or not taken requires the calculation of the branch target. This calculation takes one cycle, meaning that taken branches will have a 1-cycle penalty. Theoretically, delayed branches have zero delay, but they include the following disadvantages:

Branch delay can increase to multiple delay slots in deeper pipelines.

Branch delay slots must be filled with useful instructions or nops.

Another approach to achieve zero-delay is to use a branch target buffer, which is a structure that caches the destination program counter or destination instruction for a branch. The figure shows the structure of branch target buffer and branch prediction buffer, which will be explained in the next slide.

Branch target buffer is usually organized as a cache with tags, making it more costly than a simple prediction buffer, which uses a small memory instead.

The approach of using a branch target buffer works as follows:

Check the PC to see if the instruction being fetched is a branch.

Store the branch target address in a branch buffer in the IF stage.

If branch is predicted taken,

then “next PC = branch target fetched from branch target buffer”

else “next PC = PC + 4”

The prediction bits are to predict whether branches are taken or not taken. They are dynamically determined by the hardware.