Some microprocessors, such as the Intel 8086, have an instruction queue built into the CPU [INTEL, 1979]. The motivation is that the CPU execution cycle contains essentially four phases: execution, fetch, read, and write. Here read, write, and fetch refer to sending and receiving information from the CPU itself. In particular, write refers to the sending of information to some address that may be associated with a display, but may just be an update of a memory location. The effect of the write depends on the instruction that commands it, and the address. The read phase involves acquiring data from some address, which may be in memory or may be associated with a terminal. Instructions may or may not read and write in this general sense when they are executed, depending upon their function. The term fetch specifically refers to acquiring an instruction from memory. The crucial fact is that the CPU is otherwise idle during the fetch operation, unless it is possible to overlap the phases. With overlap, some of the phases can be done simultaneously by the hardware. The instruction queue that makes this possible is a set of 6 registers, which are storage locations, like memory cells, except that they are in the CPU itself, not in memory. (Register is the more general term; memory cells are registers used in a particular way.) The scheme is a queue, as illustrated in Figure J.1.
The IQ (Instruction Queue) is used to store the next few instructions that are expected to be executed in order. Execution of a branch operation (conditional or not) will cause the next instruction to be taken out of sequence, and so the next instruction may need to be fetched immediately and the instruction queue reloaded from the new point. Without concern for branches, however, fetches into the IQ may take place during execution of the current instruction.
A schematic trace of the overlap can be given using exec, wr, f, and rd as abbreviations for execute, write, fetch, and read, respectively. A simplified example of instructions 1,2,3,4, and 5 during eight units of time might look like Figure J.2.
The amount of time saved by the IQ mechanism depends upon a variety of factors:
The frequency of branch operations. A rough estimate is that 1 instruction in 16 is a branch.
Program PGJ.1 is appropriate at this point.
The fetch of instructions from the program store is an execution bottleneck, and so is the fetch of data, which can be relieved with the use of a cache memory. (Data, from the viewpoint of the CPU, include instructions--whatever comes from memory.) A cache memory is memory that can be accessed much more rapidly than the main store because of the way it is constructed. It costs more, else it would simply replace the main store, and so it is relatively small.
Typically, a cache memory ranges from 1K (1024 bytes) to 64K and is 5 to 10 times as fast as main memory, having perhaps 100 nsec (nanoseconds, 10-9 seconds) for an access time. Circuitry is designed into the machine to keep track of what is in the cache; it is a "cache directory," such as Figure J.3 shows.
Programs for trying it yourself
PGJ.1 Write a program that explores the effect of an IQ. Instructions are to be chosen at random, with:
100 percent requiring an execute phase and a fetch phase, of 6 and 10 units of time, respectively
25 percent requiring an additional read phase of 8 units of time
50 percent requiring an additional write phase of 8 units of time
20 percent are branch operations that empty an IQ if it is being used
PGJ.2 Write a program that explores the effect of a cache memory. A sequence of memory references are chosen to randomly call on segments in the cache memory 85 percent of the time. Two units of time are required for retrieval from the cache and 10 for direct retrieval from memory (which brings a new segment to the cache). The tail segment is twice as likely as the next to be hit; it in turn is twice as likely as the next, and so on. The overhead cost of moving a segment to the tail is one unit of time. There are six segments in the cache. Choose five sequences of 40 memory requests each, and determine the total retrieval time needed for them.
PGJ.3 Combine PGJ.1 and PGJ.2 to determine the effect of an IQ and a cache acting in concert.