How to keep the instruction prefetcher filled up
Clash Royale CLAN TAG#URR8PPP
up vote
9
down vote
favorite
My understanding is that the Intel 8088 has this buffer which reads ahead in the instruction stream whenever it has a spare bus cycle or two, so that when the time comes to execute that instruction, if you're lucky, that instruction doesn't need to be fetched from DRAM because it's already done for you. So for each instruction there are two possibilities:
The instruction has been fetched already and is ready to be decoded
The instruction has not been fetched, and therefore the processor must stall until it has
It's not just luck though, because if there has recently been a spare cycle of the appropriate kind here or there, that instruction is more likely to have been fetched.
So presumably there are/were peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time. So how does the instruction stream need to be reordered then, to decrease the latency of this instruction prefetch buffer?
instruction-set 8088
 |Â
show 8 more comments
up vote
9
down vote
favorite
My understanding is that the Intel 8088 has this buffer which reads ahead in the instruction stream whenever it has a spare bus cycle or two, so that when the time comes to execute that instruction, if you're lucky, that instruction doesn't need to be fetched from DRAM because it's already done for you. So for each instruction there are two possibilities:
The instruction has been fetched already and is ready to be decoded
The instruction has not been fetched, and therefore the processor must stall until it has
It's not just luck though, because if there has recently been a spare cycle of the appropriate kind here or there, that instruction is more likely to have been fetched.
So presumably there are/were peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time. So how does the instruction stream need to be reordered then, to decrease the latency of this instruction prefetch buffer?
instruction-set 8088
Does the same go for the 8086?
â Wilson
Aug 31 at 9:59
yes, although the 8086 queue is longer, so underruns are less likely.
â Jules
Aug 31 at 10:46
1
Optimizing for the 8088 is different from optimizing for any other processor in the series; for whatever reason, most of what was written about optimizing in that era--even when PC clones were essentially all 8088-based--would have applied to later processors, but not the 8088. On the 8088, focus on instruction length and memory accesses. If you need the value that's in DX in AX, and don't care if it stays in DX, an "XCHG AX,DX" that supposedly takes 3 cycles (but in practice takes four) will vastly outperform a "MOV AX,DX".(nominally 2 cycles, but in practice eight).
â supercat
Aug 31 at 16:10
2
@Wilson: On the 8086, instruction fetches and 16-bit accesses only take half as long as on the 8088, but the execution unit is the same speed. On the 8088, overall performance is so thoroughly dominated by memory accesses that from an optimization perspective one can often ignore practically everything else other than multiplies and divides. The 8086, however, is much more balanced, and will much more often need to wait on the execution unit, making its performance far more important.
â supercat
Aug 31 at 16:20
1
@supercat You seem to responding to something I didn't say. I was only talking about optimizations that would improve the effectiveness of the instruction prefetcher, specifically what the original poster calls "peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time." Compilers didn't do this, and near as I can tell neither did assembly language programmers, probably because it would be ineffective for the reasons you gave in your answer.
â Ross Ridge
Sep 1 at 17:52
 |Â
show 8 more comments
up vote
9
down vote
favorite
up vote
9
down vote
favorite
My understanding is that the Intel 8088 has this buffer which reads ahead in the instruction stream whenever it has a spare bus cycle or two, so that when the time comes to execute that instruction, if you're lucky, that instruction doesn't need to be fetched from DRAM because it's already done for you. So for each instruction there are two possibilities:
The instruction has been fetched already and is ready to be decoded
The instruction has not been fetched, and therefore the processor must stall until it has
It's not just luck though, because if there has recently been a spare cycle of the appropriate kind here or there, that instruction is more likely to have been fetched.
So presumably there are/were peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time. So how does the instruction stream need to be reordered then, to decrease the latency of this instruction prefetch buffer?
instruction-set 8088
My understanding is that the Intel 8088 has this buffer which reads ahead in the instruction stream whenever it has a spare bus cycle or two, so that when the time comes to execute that instruction, if you're lucky, that instruction doesn't need to be fetched from DRAM because it's already done for you. So for each instruction there are two possibilities:
The instruction has been fetched already and is ready to be decoded
The instruction has not been fetched, and therefore the processor must stall until it has
It's not just luck though, because if there has recently been a spare cycle of the appropriate kind here or there, that instruction is more likely to have been fetched.
So presumably there are/were peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time. So how does the instruction stream need to be reordered then, to decrease the latency of this instruction prefetch buffer?
instruction-set 8088
instruction-set 8088
asked Aug 31 at 9:59
Wilson
8,576437107
8,576437107
Does the same go for the 8086?
â Wilson
Aug 31 at 9:59
yes, although the 8086 queue is longer, so underruns are less likely.
â Jules
Aug 31 at 10:46
1
Optimizing for the 8088 is different from optimizing for any other processor in the series; for whatever reason, most of what was written about optimizing in that era--even when PC clones were essentially all 8088-based--would have applied to later processors, but not the 8088. On the 8088, focus on instruction length and memory accesses. If you need the value that's in DX in AX, and don't care if it stays in DX, an "XCHG AX,DX" that supposedly takes 3 cycles (but in practice takes four) will vastly outperform a "MOV AX,DX".(nominally 2 cycles, but in practice eight).
â supercat
Aug 31 at 16:10
2
@Wilson: On the 8086, instruction fetches and 16-bit accesses only take half as long as on the 8088, but the execution unit is the same speed. On the 8088, overall performance is so thoroughly dominated by memory accesses that from an optimization perspective one can often ignore practically everything else other than multiplies and divides. The 8086, however, is much more balanced, and will much more often need to wait on the execution unit, making its performance far more important.
â supercat
Aug 31 at 16:20
1
@supercat You seem to responding to something I didn't say. I was only talking about optimizations that would improve the effectiveness of the instruction prefetcher, specifically what the original poster calls "peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time." Compilers didn't do this, and near as I can tell neither did assembly language programmers, probably because it would be ineffective for the reasons you gave in your answer.
â Ross Ridge
Sep 1 at 17:52
 |Â
show 8 more comments
Does the same go for the 8086?
â Wilson
Aug 31 at 9:59
yes, although the 8086 queue is longer, so underruns are less likely.
â Jules
Aug 31 at 10:46
1
Optimizing for the 8088 is different from optimizing for any other processor in the series; for whatever reason, most of what was written about optimizing in that era--even when PC clones were essentially all 8088-based--would have applied to later processors, but not the 8088. On the 8088, focus on instruction length and memory accesses. If you need the value that's in DX in AX, and don't care if it stays in DX, an "XCHG AX,DX" that supposedly takes 3 cycles (but in practice takes four) will vastly outperform a "MOV AX,DX".(nominally 2 cycles, but in practice eight).
â supercat
Aug 31 at 16:10
2
@Wilson: On the 8086, instruction fetches and 16-bit accesses only take half as long as on the 8088, but the execution unit is the same speed. On the 8088, overall performance is so thoroughly dominated by memory accesses that from an optimization perspective one can often ignore practically everything else other than multiplies and divides. The 8086, however, is much more balanced, and will much more often need to wait on the execution unit, making its performance far more important.
â supercat
Aug 31 at 16:20
1
@supercat You seem to responding to something I didn't say. I was only talking about optimizations that would improve the effectiveness of the instruction prefetcher, specifically what the original poster calls "peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time." Compilers didn't do this, and near as I can tell neither did assembly language programmers, probably because it would be ineffective for the reasons you gave in your answer.
â Ross Ridge
Sep 1 at 17:52
Does the same go for the 8086?
â Wilson
Aug 31 at 9:59
Does the same go for the 8086?
â Wilson
Aug 31 at 9:59
yes, although the 8086 queue is longer, so underruns are less likely.
â Jules
Aug 31 at 10:46
yes, although the 8086 queue is longer, so underruns are less likely.
â Jules
Aug 31 at 10:46
1
1
Optimizing for the 8088 is different from optimizing for any other processor in the series; for whatever reason, most of what was written about optimizing in that era--even when PC clones were essentially all 8088-based--would have applied to later processors, but not the 8088. On the 8088, focus on instruction length and memory accesses. If you need the value that's in DX in AX, and don't care if it stays in DX, an "XCHG AX,DX" that supposedly takes 3 cycles (but in practice takes four) will vastly outperform a "MOV AX,DX".(nominally 2 cycles, but in practice eight).
â supercat
Aug 31 at 16:10
Optimizing for the 8088 is different from optimizing for any other processor in the series; for whatever reason, most of what was written about optimizing in that era--even when PC clones were essentially all 8088-based--would have applied to later processors, but not the 8088. On the 8088, focus on instruction length and memory accesses. If you need the value that's in DX in AX, and don't care if it stays in DX, an "XCHG AX,DX" that supposedly takes 3 cycles (but in practice takes four) will vastly outperform a "MOV AX,DX".(nominally 2 cycles, but in practice eight).
â supercat
Aug 31 at 16:10
2
2
@Wilson: On the 8086, instruction fetches and 16-bit accesses only take half as long as on the 8088, but the execution unit is the same speed. On the 8088, overall performance is so thoroughly dominated by memory accesses that from an optimization perspective one can often ignore practically everything else other than multiplies and divides. The 8086, however, is much more balanced, and will much more often need to wait on the execution unit, making its performance far more important.
â supercat
Aug 31 at 16:20
@Wilson: On the 8086, instruction fetches and 16-bit accesses only take half as long as on the 8088, but the execution unit is the same speed. On the 8088, overall performance is so thoroughly dominated by memory accesses that from an optimization perspective one can often ignore practically everything else other than multiplies and divides. The 8086, however, is much more balanced, and will much more often need to wait on the execution unit, making its performance far more important.
â supercat
Aug 31 at 16:20
1
1
@supercat You seem to responding to something I didn't say. I was only talking about optimizations that would improve the effectiveness of the instruction prefetcher, specifically what the original poster calls "peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time." Compilers didn't do this, and near as I can tell neither did assembly language programmers, probably because it would be ineffective for the reasons you gave in your answer.
â Ross Ridge
Sep 1 at 17:52
@supercat You seem to responding to something I didn't say. I was only talking about optimizations that would improve the effectiveness of the instruction prefetcher, specifically what the original poster calls "peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time." Compilers didn't do this, and near as I can tell neither did assembly language programmers, probably because it would be ineffective for the reasons you gave in your answer.
â Ross Ridge
Sep 1 at 17:52
 |Â
show 8 more comments
3 Answers
3
active
oldest
votes
up vote
12
down vote
My understanding is that the Intel 8088 has this buffer which reads ahead in the instruction stream whenever it has a spare bus cycle or two, so that when the time comes to execute that instruction, if you're lucky, that instruction doesn't need to be fetched from DRAM because it's already done for you.
The buffer isn't some kind of optional read ahead,it is the instruction buffer. Before an instruction can be executed, it has to reside fully within the buffer. Only then it can be transfered, as a whole, to the execution unit.
What differentiates it from other machines with a similar fetch logic is that it doesn't start fetching when a new instruction is needed, but whenever there is empty storage within and there is a memory cycle without an outstanding data read or write request.
The instruction has been fetched already and is ready to be decoded
It will be already partially decoded by the BIU (Bus Interface Unit), for example to decide about its length.
The instruction has not been fetched, and therefore the processor must stall until it has
And the third one is a partially fetched instruction. Just it isn't a stall, it only means that now instruction fetches get priority over outstanding data fetches and the EU (Execution Unit) waits for the next one to be fetched in full.
It's not just luck though, because if there has recently been a spare cycle of the appropriate kind here or there, that instruction is more likely to have been fetched.
And there are many such cycles where the EU is busy but there is no outstanding data request to be handled.
So presumably there are/were peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time.
No need for such bloaty software as the likelihood for such 'holes' is rather large. Just keep in mind, every address calculation on the x86 takes 6-12 clock cycles (T-states). So as soon as there is a memory operand (i.e. not immediate or register), already one fetch can be made. Plus, of course, the cycle time for the instruction itself, which is anywhere between 2 (register-to-register move) and 133 (16-bit multiplication) clocks.
Having the EU wait for the BIU isn't anything special or a problem and by no means a fault of the BIU or software. It just means that the memory bus is saturated by the combined need of transfer for instructions and data into and out of the CPU, beyond the amount of cycles needed to compute. In fact, this is the desired state. Useful bus utilization is at maximum. In comparison, even the much acclaimed pipelining of the 6502 results in up to 25% waste cycles where the bus stalls (*1).
So, unless the instruction stream is made up of all register-to-register instructions of the simple kind, there will be plenty of room for instruction fetch. And at least on a 8086, even then it rarely runs empty, as these instructions are (mostly) 2 or 3 cycles with 1 or 2 bytes encoding. 1-Byte encoding with two-cycle execution means the bus is fully occupied with instruction fetch (2 bytes at once) while the EU is fully occupied with executing them - two (single byte) instructions per two byte fetch. It needs a stream of several two-byte register-to-register instructions to have it run low. Assuming a filled buffer in the beginning, it will need a sequence of 5 two byte instructions until the EU has to wait for one cycle. Since even a simple register-to-memory compare (like in a search loop) is already 9 cycles, there isn't much to fear.
Now on the 8088, the situation is worse due the limited bus size delivering only half the bandwidth. Here a sequence of several register-to-register instructions will make the EU wait for new instructions, as its 5-byte buffer will run empty soon. Again, this is not something that happens a lot in average code, as memory address calculation (and more complex instructions than such simple) will leave enough space for fetching; even with just half the bandwidth the effective throughput does not fall down to 50% (of an 8086) but still delivers 60-70%. Not bad for requiring only half the RAM and ROM chips :))
It's always helpful to keep in mind that this is just a buffer for one instruction at maximum length, not some kind of cache or the like. All it does is improve the performance and allow more complex instructions while decreasing chip complexity to handle them. Instead of having fixed 'points' marked within an instruction to fetch (parts of) the next instruction, it's an abstraction layer, so a mechanic independent of the execution can handle fetch.
Bottom line, a 8086/88 EU waiting for fetches is just one utilizing the bus at maximum, thus achieving already the highest possible speed.
So how does the instruction stream need to be reordered then, to decrease the latency of this instruction prefetch buffer?
Mix in some memory access, at best with way complicated addressing, and there will be plenty of room to keep the buffer filled ahead of time. That's also why compiler code - which is usually way more memory focused than assembly - rarely runs into this barrier. Of course, this means in turn that it is not using the full memory bandwidth like hand written Assembly does :))
Does the same go for the 8086?
(From a comment by Wilson)
Yes, as described above. On the 8086 it's just even harder to get it to stall. The buffer on a 8086 is 6 bytes instead of 5 for the 8088. The maximum instruction in original 8086/88 code is 5 bytes long (*2), thus the buffer only needs to be 5 bytes to deliver a whole instruction. Since the 8086 operates on a 16 bit bus, fetches will always be two bytes at once, thus filling in a full instruction can result in 6 bytes fetched.
*1 - Most of them due the second cycle of a single byte instruction - much the same way as with an 8086, except for completely different reasons.
*2 - No, except for REP, no prefix is part of the instruction when the buffer is considered, as segment or lock prefixes are instructions for the BIU, not the EU.
So does the buffer always hold exactly one instruction?
â Wilson
Aug 31 at 18:02
@Wilson: The buffer holds as many instructions as have been fetched and will fit.
â supercat
Aug 31 at 20:34
1
@Wilson No. It holds as many asfit in up to 5(6) bytes. Its size is made to hold even the longest possible instruction. And an instruction gets dispatched to the EU as soon as it is complete (and the EU can accept one). But it will not stop fetching when an instruction is complete, but use up otehrwise unused cycles until the buffer is full. This smoothes out combinations of fast and slow long and short instructions.
â Raffzahn
Aug 31 at 21:43
@Raffzahn: I think the prefetch limit on the 8088 is four bytes, rather than five. Segment prefixes aren't really treated as parts of the succeeding instruction, but more like special instructions that set a "use alternate segment" flag and don't allow interrupts to fire until after the next instruction executes. I think this changed in later processors, but IIRC on the 8088 if code executes e.g.SS:REP MOVSB
the source operand will come from the stack segment unless or until an interrupt occurs, but after execution resumes the instruction will resume processing asREP MOVSB
.
â supercat
Sep 1 at 17:02
1
@supercat - the 80C88 (which I believe is functionally identical) definitely specifies 5 bytes in its data sheet. It also states that the prefetch algorithm of the 8086 (which only fetches when there are 2 bytes free in the queue) is modified to always fetch whenever possible, so it should be able to fill all 5 bytes. As to the execution of 6 byte instructions, my understanding is that the immediate operand isn't required in the buffer, as it is always fetched as part of execution (which is why timings for instructions with an immediate operand are longer than register-register ops).
â Jules
Sep 2 at 23:00
 |Â
show 2 more comments
up vote
6
down vote
The 8088 and 8086 are microcoded CPUs, and need multiple cycles to execute each instruction. The fastest instructions take at least 2 cycles to execute, and most take much longer. Any instruction that accesses memory takes at least 8 cycles, and often more like 15-20. On the other hand memory accesses occupy the bus for 4 cycles.
It therefore follows that the only instructions that deplete the queue without chance to refill it are those short 2-3 cycle instructions. These are basically only register-to-register operations, excluding multiplies and divides. Register/immediate instructions take 4 cycles, but also fetch an extra instruction byte, so also deplete the queue.
As long as you disperse reasonable numbers of longer operations (including anything that involves a memory access, or multiply/divide operations) in the instruction stream, it should be unlikely that the instruction buffer underruns other than immediately after a jump.
1
Instructions that access memory tie up the memory bus with the accesses they perform. The time for something like "ADD [1234],AX" is listed as 24+EA, with EA listed as 6, for a total of 30. That seems like that should be enough time to fill the prefetch queue, but the instruction takes four bytes and performs four other memory accesses, tying up the bus interface unit for 32 cycles.
â supercat
Aug 31 at 16:37
add a comment |Â
up vote
5
down vote
On the 8088 (which was used in the original IBM PC and early clones thereof), but not on any of the other processors used in subsequent machines, a typical two-byte instruction will take eight cycles to fetch and 2-3 cycles to execute. Because of this disparity between instruction-fetch time and execution time, the memory bus will be almost never be idle except when executing long instructions like multiply/divide. Thus, in many cases the simplest and most accurate way of the estimating execution time for a piece of code is to simply count the memory operations and multiply by four. There are some cases that require more precise analysis of the timing relationship between the execution and bus-interface units, but in most of those the simplest way to determine execution time is to feed the code to an 8088 and measure it.
Interestingly, many books from the era when the 8088 was the dominant processor ignore instruction-fetch times. Consequently, they give optimization advice which would tend to be reasonable on future machines using the 8086 and 80286, but is in many cases just plain wrong on the 8088. For example, some such books suggest that if one needs to move DX to AX or vice versa, and don't care about what happens to the other register, using "XCHG AX,DX" will be smaller but slower than using "MOV AX,DX" or "MOV DX,AX". While that may have be a speed/space trade-off on some later CPUs, on the 8088 used in the PC the "slower" form is in practice almost always faster.
Note that even though effective address computation times may appear long, they are rarely sufficient to give the bus unit any idle time. An instruction like "add [bx+1234],ax" is four bytes long, which means it will take 16 cycles to fetch, and it will have to perform four bus operations (tying up the memory unit another 16 cycles) when it executes. One might look at the execution-unit time and think it's so huge that the bus unit would catch up, but since the instruction requires performing 8 bus operations, the time required for the effective address calculation really doesn't matter. If one did nothing but memory operations using short instructions, the bus interface unit might catch up with the execution unit (many short instructions that access memory have execution times that are a cycle or two longer than the time required to perform all the memory accesses) but most two-cycle instructions that don't access memory give the execution unit a 5-6 cycle advantage, so for most typical instruction mixes the memory-access times will dominate.
2
It's probably worth mentioning that the 8088 was the CPU in the original IBM PC, and there were far more PCs with 8088s than 8086s.
â Ross Ridge
Aug 31 at 19:35
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
12
down vote
My understanding is that the Intel 8088 has this buffer which reads ahead in the instruction stream whenever it has a spare bus cycle or two, so that when the time comes to execute that instruction, if you're lucky, that instruction doesn't need to be fetched from DRAM because it's already done for you.
The buffer isn't some kind of optional read ahead,it is the instruction buffer. Before an instruction can be executed, it has to reside fully within the buffer. Only then it can be transfered, as a whole, to the execution unit.
What differentiates it from other machines with a similar fetch logic is that it doesn't start fetching when a new instruction is needed, but whenever there is empty storage within and there is a memory cycle without an outstanding data read or write request.
The instruction has been fetched already and is ready to be decoded
It will be already partially decoded by the BIU (Bus Interface Unit), for example to decide about its length.
The instruction has not been fetched, and therefore the processor must stall until it has
And the third one is a partially fetched instruction. Just it isn't a stall, it only means that now instruction fetches get priority over outstanding data fetches and the EU (Execution Unit) waits for the next one to be fetched in full.
It's not just luck though, because if there has recently been a spare cycle of the appropriate kind here or there, that instruction is more likely to have been fetched.
And there are many such cycles where the EU is busy but there is no outstanding data request to be handled.
So presumably there are/were peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time.
No need for such bloaty software as the likelihood for such 'holes' is rather large. Just keep in mind, every address calculation on the x86 takes 6-12 clock cycles (T-states). So as soon as there is a memory operand (i.e. not immediate or register), already one fetch can be made. Plus, of course, the cycle time for the instruction itself, which is anywhere between 2 (register-to-register move) and 133 (16-bit multiplication) clocks.
Having the EU wait for the BIU isn't anything special or a problem and by no means a fault of the BIU or software. It just means that the memory bus is saturated by the combined need of transfer for instructions and data into and out of the CPU, beyond the amount of cycles needed to compute. In fact, this is the desired state. Useful bus utilization is at maximum. In comparison, even the much acclaimed pipelining of the 6502 results in up to 25% waste cycles where the bus stalls (*1).
So, unless the instruction stream is made up of all register-to-register instructions of the simple kind, there will be plenty of room for instruction fetch. And at least on a 8086, even then it rarely runs empty, as these instructions are (mostly) 2 or 3 cycles with 1 or 2 bytes encoding. 1-Byte encoding with two-cycle execution means the bus is fully occupied with instruction fetch (2 bytes at once) while the EU is fully occupied with executing them - two (single byte) instructions per two byte fetch. It needs a stream of several two-byte register-to-register instructions to have it run low. Assuming a filled buffer in the beginning, it will need a sequence of 5 two byte instructions until the EU has to wait for one cycle. Since even a simple register-to-memory compare (like in a search loop) is already 9 cycles, there isn't much to fear.
Now on the 8088, the situation is worse due the limited bus size delivering only half the bandwidth. Here a sequence of several register-to-register instructions will make the EU wait for new instructions, as its 5-byte buffer will run empty soon. Again, this is not something that happens a lot in average code, as memory address calculation (and more complex instructions than such simple) will leave enough space for fetching; even with just half the bandwidth the effective throughput does not fall down to 50% (of an 8086) but still delivers 60-70%. Not bad for requiring only half the RAM and ROM chips :))
It's always helpful to keep in mind that this is just a buffer for one instruction at maximum length, not some kind of cache or the like. All it does is improve the performance and allow more complex instructions while decreasing chip complexity to handle them. Instead of having fixed 'points' marked within an instruction to fetch (parts of) the next instruction, it's an abstraction layer, so a mechanic independent of the execution can handle fetch.
Bottom line, a 8086/88 EU waiting for fetches is just one utilizing the bus at maximum, thus achieving already the highest possible speed.
So how does the instruction stream need to be reordered then, to decrease the latency of this instruction prefetch buffer?
Mix in some memory access, at best with way complicated addressing, and there will be plenty of room to keep the buffer filled ahead of time. That's also why compiler code - which is usually way more memory focused than assembly - rarely runs into this barrier. Of course, this means in turn that it is not using the full memory bandwidth like hand written Assembly does :))
Does the same go for the 8086?
(From a comment by Wilson)
Yes, as described above. On the 8086 it's just even harder to get it to stall. The buffer on a 8086 is 6 bytes instead of 5 for the 8088. The maximum instruction in original 8086/88 code is 5 bytes long (*2), thus the buffer only needs to be 5 bytes to deliver a whole instruction. Since the 8086 operates on a 16 bit bus, fetches will always be two bytes at once, thus filling in a full instruction can result in 6 bytes fetched.
*1 - Most of them due the second cycle of a single byte instruction - much the same way as with an 8086, except for completely different reasons.
*2 - No, except for REP, no prefix is part of the instruction when the buffer is considered, as segment or lock prefixes are instructions for the BIU, not the EU.
So does the buffer always hold exactly one instruction?
â Wilson
Aug 31 at 18:02
@Wilson: The buffer holds as many instructions as have been fetched and will fit.
â supercat
Aug 31 at 20:34
1
@Wilson No. It holds as many asfit in up to 5(6) bytes. Its size is made to hold even the longest possible instruction. And an instruction gets dispatched to the EU as soon as it is complete (and the EU can accept one). But it will not stop fetching when an instruction is complete, but use up otehrwise unused cycles until the buffer is full. This smoothes out combinations of fast and slow long and short instructions.
â Raffzahn
Aug 31 at 21:43
@Raffzahn: I think the prefetch limit on the 8088 is four bytes, rather than five. Segment prefixes aren't really treated as parts of the succeeding instruction, but more like special instructions that set a "use alternate segment" flag and don't allow interrupts to fire until after the next instruction executes. I think this changed in later processors, but IIRC on the 8088 if code executes e.g.SS:REP MOVSB
the source operand will come from the stack segment unless or until an interrupt occurs, but after execution resumes the instruction will resume processing asREP MOVSB
.
â supercat
Sep 1 at 17:02
1
@supercat - the 80C88 (which I believe is functionally identical) definitely specifies 5 bytes in its data sheet. It also states that the prefetch algorithm of the 8086 (which only fetches when there are 2 bytes free in the queue) is modified to always fetch whenever possible, so it should be able to fill all 5 bytes. As to the execution of 6 byte instructions, my understanding is that the immediate operand isn't required in the buffer, as it is always fetched as part of execution (which is why timings for instructions with an immediate operand are longer than register-register ops).
â Jules
Sep 2 at 23:00
 |Â
show 2 more comments
up vote
12
down vote
My understanding is that the Intel 8088 has this buffer which reads ahead in the instruction stream whenever it has a spare bus cycle or two, so that when the time comes to execute that instruction, if you're lucky, that instruction doesn't need to be fetched from DRAM because it's already done for you.
The buffer isn't some kind of optional read ahead,it is the instruction buffer. Before an instruction can be executed, it has to reside fully within the buffer. Only then it can be transfered, as a whole, to the execution unit.
What differentiates it from other machines with a similar fetch logic is that it doesn't start fetching when a new instruction is needed, but whenever there is empty storage within and there is a memory cycle without an outstanding data read or write request.
The instruction has been fetched already and is ready to be decoded
It will be already partially decoded by the BIU (Bus Interface Unit), for example to decide about its length.
The instruction has not been fetched, and therefore the processor must stall until it has
And the third one is a partially fetched instruction. Just it isn't a stall, it only means that now instruction fetches get priority over outstanding data fetches and the EU (Execution Unit) waits for the next one to be fetched in full.
It's not just luck though, because if there has recently been a spare cycle of the appropriate kind here or there, that instruction is more likely to have been fetched.
And there are many such cycles where the EU is busy but there is no outstanding data request to be handled.
So presumably there are/were peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time.
No need for such bloaty software as the likelihood for such 'holes' is rather large. Just keep in mind, every address calculation on the x86 takes 6-12 clock cycles (T-states). So as soon as there is a memory operand (i.e. not immediate or register), already one fetch can be made. Plus, of course, the cycle time for the instruction itself, which is anywhere between 2 (register-to-register move) and 133 (16-bit multiplication) clocks.
Having the EU wait for the BIU isn't anything special or a problem and by no means a fault of the BIU or software. It just means that the memory bus is saturated by the combined need of transfer for instructions and data into and out of the CPU, beyond the amount of cycles needed to compute. In fact, this is the desired state. Useful bus utilization is at maximum. In comparison, even the much acclaimed pipelining of the 6502 results in up to 25% waste cycles where the bus stalls (*1).
So, unless the instruction stream is made up of all register-to-register instructions of the simple kind, there will be plenty of room for instruction fetch. And at least on a 8086, even then it rarely runs empty, as these instructions are (mostly) 2 or 3 cycles with 1 or 2 bytes encoding. 1-Byte encoding with two-cycle execution means the bus is fully occupied with instruction fetch (2 bytes at once) while the EU is fully occupied with executing them - two (single byte) instructions per two byte fetch. It needs a stream of several two-byte register-to-register instructions to have it run low. Assuming a filled buffer in the beginning, it will need a sequence of 5 two byte instructions until the EU has to wait for one cycle. Since even a simple register-to-memory compare (like in a search loop) is already 9 cycles, there isn't much to fear.
Now on the 8088, the situation is worse due the limited bus size delivering only half the bandwidth. Here a sequence of several register-to-register instructions will make the EU wait for new instructions, as its 5-byte buffer will run empty soon. Again, this is not something that happens a lot in average code, as memory address calculation (and more complex instructions than such simple) will leave enough space for fetching; even with just half the bandwidth the effective throughput does not fall down to 50% (of an 8086) but still delivers 60-70%. Not bad for requiring only half the RAM and ROM chips :))
It's always helpful to keep in mind that this is just a buffer for one instruction at maximum length, not some kind of cache or the like. All it does is improve the performance and allow more complex instructions while decreasing chip complexity to handle them. Instead of having fixed 'points' marked within an instruction to fetch (parts of) the next instruction, it's an abstraction layer, so a mechanic independent of the execution can handle fetch.
Bottom line, a 8086/88 EU waiting for fetches is just one utilizing the bus at maximum, thus achieving already the highest possible speed.
So how does the instruction stream need to be reordered then, to decrease the latency of this instruction prefetch buffer?
Mix in some memory access, at best with way complicated addressing, and there will be plenty of room to keep the buffer filled ahead of time. That's also why compiler code - which is usually way more memory focused than assembly - rarely runs into this barrier. Of course, this means in turn that it is not using the full memory bandwidth like hand written Assembly does :))
Does the same go for the 8086?
(From a comment by Wilson)
Yes, as described above. On the 8086 it's just even harder to get it to stall. The buffer on a 8086 is 6 bytes instead of 5 for the 8088. The maximum instruction in original 8086/88 code is 5 bytes long (*2), thus the buffer only needs to be 5 bytes to deliver a whole instruction. Since the 8086 operates on a 16 bit bus, fetches will always be two bytes at once, thus filling in a full instruction can result in 6 bytes fetched.
*1 - Most of them due the second cycle of a single byte instruction - much the same way as with an 8086, except for completely different reasons.
*2 - No, except for REP, no prefix is part of the instruction when the buffer is considered, as segment or lock prefixes are instructions for the BIU, not the EU.
So does the buffer always hold exactly one instruction?
â Wilson
Aug 31 at 18:02
@Wilson: The buffer holds as many instructions as have been fetched and will fit.
â supercat
Aug 31 at 20:34
1
@Wilson No. It holds as many asfit in up to 5(6) bytes. Its size is made to hold even the longest possible instruction. And an instruction gets dispatched to the EU as soon as it is complete (and the EU can accept one). But it will not stop fetching when an instruction is complete, but use up otehrwise unused cycles until the buffer is full. This smoothes out combinations of fast and slow long and short instructions.
â Raffzahn
Aug 31 at 21:43
@Raffzahn: I think the prefetch limit on the 8088 is four bytes, rather than five. Segment prefixes aren't really treated as parts of the succeeding instruction, but more like special instructions that set a "use alternate segment" flag and don't allow interrupts to fire until after the next instruction executes. I think this changed in later processors, but IIRC on the 8088 if code executes e.g.SS:REP MOVSB
the source operand will come from the stack segment unless or until an interrupt occurs, but after execution resumes the instruction will resume processing asREP MOVSB
.
â supercat
Sep 1 at 17:02
1
@supercat - the 80C88 (which I believe is functionally identical) definitely specifies 5 bytes in its data sheet. It also states that the prefetch algorithm of the 8086 (which only fetches when there are 2 bytes free in the queue) is modified to always fetch whenever possible, so it should be able to fill all 5 bytes. As to the execution of 6 byte instructions, my understanding is that the immediate operand isn't required in the buffer, as it is always fetched as part of execution (which is why timings for instructions with an immediate operand are longer than register-register ops).
â Jules
Sep 2 at 23:00
 |Â
show 2 more comments
up vote
12
down vote
up vote
12
down vote
My understanding is that the Intel 8088 has this buffer which reads ahead in the instruction stream whenever it has a spare bus cycle or two, so that when the time comes to execute that instruction, if you're lucky, that instruction doesn't need to be fetched from DRAM because it's already done for you.
The buffer isn't some kind of optional read ahead,it is the instruction buffer. Before an instruction can be executed, it has to reside fully within the buffer. Only then it can be transfered, as a whole, to the execution unit.
What differentiates it from other machines with a similar fetch logic is that it doesn't start fetching when a new instruction is needed, but whenever there is empty storage within and there is a memory cycle without an outstanding data read or write request.
The instruction has been fetched already and is ready to be decoded
It will be already partially decoded by the BIU (Bus Interface Unit), for example to decide about its length.
The instruction has not been fetched, and therefore the processor must stall until it has
And the third one is a partially fetched instruction. Just it isn't a stall, it only means that now instruction fetches get priority over outstanding data fetches and the EU (Execution Unit) waits for the next one to be fetched in full.
It's not just luck though, because if there has recently been a spare cycle of the appropriate kind here or there, that instruction is more likely to have been fetched.
And there are many such cycles where the EU is busy but there is no outstanding data request to be handled.
So presumably there are/were peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time.
No need for such bloaty software as the likelihood for such 'holes' is rather large. Just keep in mind, every address calculation on the x86 takes 6-12 clock cycles (T-states). So as soon as there is a memory operand (i.e. not immediate or register), already one fetch can be made. Plus, of course, the cycle time for the instruction itself, which is anywhere between 2 (register-to-register move) and 133 (16-bit multiplication) clocks.
Having the EU wait for the BIU isn't anything special or a problem and by no means a fault of the BIU or software. It just means that the memory bus is saturated by the combined need of transfer for instructions and data into and out of the CPU, beyond the amount of cycles needed to compute. In fact, this is the desired state. Useful bus utilization is at maximum. In comparison, even the much acclaimed pipelining of the 6502 results in up to 25% waste cycles where the bus stalls (*1).
So, unless the instruction stream is made up of all register-to-register instructions of the simple kind, there will be plenty of room for instruction fetch. And at least on a 8086, even then it rarely runs empty, as these instructions are (mostly) 2 or 3 cycles with 1 or 2 bytes encoding. 1-Byte encoding with two-cycle execution means the bus is fully occupied with instruction fetch (2 bytes at once) while the EU is fully occupied with executing them - two (single byte) instructions per two byte fetch. It needs a stream of several two-byte register-to-register instructions to have it run low. Assuming a filled buffer in the beginning, it will need a sequence of 5 two byte instructions until the EU has to wait for one cycle. Since even a simple register-to-memory compare (like in a search loop) is already 9 cycles, there isn't much to fear.
Now on the 8088, the situation is worse due the limited bus size delivering only half the bandwidth. Here a sequence of several register-to-register instructions will make the EU wait for new instructions, as its 5-byte buffer will run empty soon. Again, this is not something that happens a lot in average code, as memory address calculation (and more complex instructions than such simple) will leave enough space for fetching; even with just half the bandwidth the effective throughput does not fall down to 50% (of an 8086) but still delivers 60-70%. Not bad for requiring only half the RAM and ROM chips :))
It's always helpful to keep in mind that this is just a buffer for one instruction at maximum length, not some kind of cache or the like. All it does is improve the performance and allow more complex instructions while decreasing chip complexity to handle them. Instead of having fixed 'points' marked within an instruction to fetch (parts of) the next instruction, it's an abstraction layer, so a mechanic independent of the execution can handle fetch.
Bottom line, a 8086/88 EU waiting for fetches is just one utilizing the bus at maximum, thus achieving already the highest possible speed.
So how does the instruction stream need to be reordered then, to decrease the latency of this instruction prefetch buffer?
Mix in some memory access, at best with way complicated addressing, and there will be plenty of room to keep the buffer filled ahead of time. That's also why compiler code - which is usually way more memory focused than assembly - rarely runs into this barrier. Of course, this means in turn that it is not using the full memory bandwidth like hand written Assembly does :))
Does the same go for the 8086?
(From a comment by Wilson)
Yes, as described above. On the 8086 it's just even harder to get it to stall. The buffer on a 8086 is 6 bytes instead of 5 for the 8088. The maximum instruction in original 8086/88 code is 5 bytes long (*2), thus the buffer only needs to be 5 bytes to deliver a whole instruction. Since the 8086 operates on a 16 bit bus, fetches will always be two bytes at once, thus filling in a full instruction can result in 6 bytes fetched.
*1 - Most of them due the second cycle of a single byte instruction - much the same way as with an 8086, except for completely different reasons.
*2 - No, except for REP, no prefix is part of the instruction when the buffer is considered, as segment or lock prefixes are instructions for the BIU, not the EU.
My understanding is that the Intel 8088 has this buffer which reads ahead in the instruction stream whenever it has a spare bus cycle or two, so that when the time comes to execute that instruction, if you're lucky, that instruction doesn't need to be fetched from DRAM because it's already done for you.
The buffer isn't some kind of optional read ahead,it is the instruction buffer. Before an instruction can be executed, it has to reside fully within the buffer. Only then it can be transfered, as a whole, to the execution unit.
What differentiates it from other machines with a similar fetch logic is that it doesn't start fetching when a new instruction is needed, but whenever there is empty storage within and there is a memory cycle without an outstanding data read or write request.
The instruction has been fetched already and is ready to be decoded
It will be already partially decoded by the BIU (Bus Interface Unit), for example to decide about its length.
The instruction has not been fetched, and therefore the processor must stall until it has
And the third one is a partially fetched instruction. Just it isn't a stall, it only means that now instruction fetches get priority over outstanding data fetches and the EU (Execution Unit) waits for the next one to be fetched in full.
It's not just luck though, because if there has recently been a spare cycle of the appropriate kind here or there, that instruction is more likely to have been fetched.
And there are many such cycles where the EU is busy but there is no outstanding data request to be handled.
So presumably there are/were peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time.
No need for such bloaty software as the likelihood for such 'holes' is rather large. Just keep in mind, every address calculation on the x86 takes 6-12 clock cycles (T-states). So as soon as there is a memory operand (i.e. not immediate or register), already one fetch can be made. Plus, of course, the cycle time for the instruction itself, which is anywhere between 2 (register-to-register move) and 133 (16-bit multiplication) clocks.
Having the EU wait for the BIU isn't anything special or a problem and by no means a fault of the BIU or software. It just means that the memory bus is saturated by the combined need of transfer for instructions and data into and out of the CPU, beyond the amount of cycles needed to compute. In fact, this is the desired state. Useful bus utilization is at maximum. In comparison, even the much acclaimed pipelining of the 6502 results in up to 25% waste cycles where the bus stalls (*1).
So, unless the instruction stream is made up of all register-to-register instructions of the simple kind, there will be plenty of room for instruction fetch. And at least on a 8086, even then it rarely runs empty, as these instructions are (mostly) 2 or 3 cycles with 1 or 2 bytes encoding. 1-Byte encoding with two-cycle execution means the bus is fully occupied with instruction fetch (2 bytes at once) while the EU is fully occupied with executing them - two (single byte) instructions per two byte fetch. It needs a stream of several two-byte register-to-register instructions to have it run low. Assuming a filled buffer in the beginning, it will need a sequence of 5 two byte instructions until the EU has to wait for one cycle. Since even a simple register-to-memory compare (like in a search loop) is already 9 cycles, there isn't much to fear.
Now on the 8088, the situation is worse due the limited bus size delivering only half the bandwidth. Here a sequence of several register-to-register instructions will make the EU wait for new instructions, as its 5-byte buffer will run empty soon. Again, this is not something that happens a lot in average code, as memory address calculation (and more complex instructions than such simple) will leave enough space for fetching; even with just half the bandwidth the effective throughput does not fall down to 50% (of an 8086) but still delivers 60-70%. Not bad for requiring only half the RAM and ROM chips :))
It's always helpful to keep in mind that this is just a buffer for one instruction at maximum length, not some kind of cache or the like. All it does is improve the performance and allow more complex instructions while decreasing chip complexity to handle them. Instead of having fixed 'points' marked within an instruction to fetch (parts of) the next instruction, it's an abstraction layer, so a mechanic independent of the execution can handle fetch.
Bottom line, a 8086/88 EU waiting for fetches is just one utilizing the bus at maximum, thus achieving already the highest possible speed.
So how does the instruction stream need to be reordered then, to decrease the latency of this instruction prefetch buffer?
Mix in some memory access, at best with way complicated addressing, and there will be plenty of room to keep the buffer filled ahead of time. That's also why compiler code - which is usually way more memory focused than assembly - rarely runs into this barrier. Of course, this means in turn that it is not using the full memory bandwidth like hand written Assembly does :))
Does the same go for the 8086?
(From a comment by Wilson)
Yes, as described above. On the 8086 it's just even harder to get it to stall. The buffer on a 8086 is 6 bytes instead of 5 for the 8088. The maximum instruction in original 8086/88 code is 5 bytes long (*2), thus the buffer only needs to be 5 bytes to deliver a whole instruction. Since the 8086 operates on a 16 bit bus, fetches will always be two bytes at once, thus filling in a full instruction can result in 6 bytes fetched.
*1 - Most of them due the second cycle of a single byte instruction - much the same way as with an 8086, except for completely different reasons.
*2 - No, except for REP, no prefix is part of the instruction when the buffer is considered, as segment or lock prefixes are instructions for the BIU, not the EU.
edited Aug 31 at 16:32
Wilson
8,576437107
8,576437107
answered Aug 31 at 11:39
Raffzahn
35.2k478139
35.2k478139
So does the buffer always hold exactly one instruction?
â Wilson
Aug 31 at 18:02
@Wilson: The buffer holds as many instructions as have been fetched and will fit.
â supercat
Aug 31 at 20:34
1
@Wilson No. It holds as many asfit in up to 5(6) bytes. Its size is made to hold even the longest possible instruction. And an instruction gets dispatched to the EU as soon as it is complete (and the EU can accept one). But it will not stop fetching when an instruction is complete, but use up otehrwise unused cycles until the buffer is full. This smoothes out combinations of fast and slow long and short instructions.
â Raffzahn
Aug 31 at 21:43
@Raffzahn: I think the prefetch limit on the 8088 is four bytes, rather than five. Segment prefixes aren't really treated as parts of the succeeding instruction, but more like special instructions that set a "use alternate segment" flag and don't allow interrupts to fire until after the next instruction executes. I think this changed in later processors, but IIRC on the 8088 if code executes e.g.SS:REP MOVSB
the source operand will come from the stack segment unless or until an interrupt occurs, but after execution resumes the instruction will resume processing asREP MOVSB
.
â supercat
Sep 1 at 17:02
1
@supercat - the 80C88 (which I believe is functionally identical) definitely specifies 5 bytes in its data sheet. It also states that the prefetch algorithm of the 8086 (which only fetches when there are 2 bytes free in the queue) is modified to always fetch whenever possible, so it should be able to fill all 5 bytes. As to the execution of 6 byte instructions, my understanding is that the immediate operand isn't required in the buffer, as it is always fetched as part of execution (which is why timings for instructions with an immediate operand are longer than register-register ops).
â Jules
Sep 2 at 23:00
 |Â
show 2 more comments
So does the buffer always hold exactly one instruction?
â Wilson
Aug 31 at 18:02
@Wilson: The buffer holds as many instructions as have been fetched and will fit.
â supercat
Aug 31 at 20:34
1
@Wilson No. It holds as many asfit in up to 5(6) bytes. Its size is made to hold even the longest possible instruction. And an instruction gets dispatched to the EU as soon as it is complete (and the EU can accept one). But it will not stop fetching when an instruction is complete, but use up otehrwise unused cycles until the buffer is full. This smoothes out combinations of fast and slow long and short instructions.
â Raffzahn
Aug 31 at 21:43
@Raffzahn: I think the prefetch limit on the 8088 is four bytes, rather than five. Segment prefixes aren't really treated as parts of the succeeding instruction, but more like special instructions that set a "use alternate segment" flag and don't allow interrupts to fire until after the next instruction executes. I think this changed in later processors, but IIRC on the 8088 if code executes e.g.SS:REP MOVSB
the source operand will come from the stack segment unless or until an interrupt occurs, but after execution resumes the instruction will resume processing asREP MOVSB
.
â supercat
Sep 1 at 17:02
1
@supercat - the 80C88 (which I believe is functionally identical) definitely specifies 5 bytes in its data sheet. It also states that the prefetch algorithm of the 8086 (which only fetches when there are 2 bytes free in the queue) is modified to always fetch whenever possible, so it should be able to fill all 5 bytes. As to the execution of 6 byte instructions, my understanding is that the immediate operand isn't required in the buffer, as it is always fetched as part of execution (which is why timings for instructions with an immediate operand are longer than register-register ops).
â Jules
Sep 2 at 23:00
So does the buffer always hold exactly one instruction?
â Wilson
Aug 31 at 18:02
So does the buffer always hold exactly one instruction?
â Wilson
Aug 31 at 18:02
@Wilson: The buffer holds as many instructions as have been fetched and will fit.
â supercat
Aug 31 at 20:34
@Wilson: The buffer holds as many instructions as have been fetched and will fit.
â supercat
Aug 31 at 20:34
1
1
@Wilson No. It holds as many asfit in up to 5(6) bytes. Its size is made to hold even the longest possible instruction. And an instruction gets dispatched to the EU as soon as it is complete (and the EU can accept one). But it will not stop fetching when an instruction is complete, but use up otehrwise unused cycles until the buffer is full. This smoothes out combinations of fast and slow long and short instructions.
â Raffzahn
Aug 31 at 21:43
@Wilson No. It holds as many asfit in up to 5(6) bytes. Its size is made to hold even the longest possible instruction. And an instruction gets dispatched to the EU as soon as it is complete (and the EU can accept one). But it will not stop fetching when an instruction is complete, but use up otehrwise unused cycles until the buffer is full. This smoothes out combinations of fast and slow long and short instructions.
â Raffzahn
Aug 31 at 21:43
@Raffzahn: I think the prefetch limit on the 8088 is four bytes, rather than five. Segment prefixes aren't really treated as parts of the succeeding instruction, but more like special instructions that set a "use alternate segment" flag and don't allow interrupts to fire until after the next instruction executes. I think this changed in later processors, but IIRC on the 8088 if code executes e.g.
SS:REP MOVSB
the source operand will come from the stack segment unless or until an interrupt occurs, but after execution resumes the instruction will resume processing as REP MOVSB
.â supercat
Sep 1 at 17:02
@Raffzahn: I think the prefetch limit on the 8088 is four bytes, rather than five. Segment prefixes aren't really treated as parts of the succeeding instruction, but more like special instructions that set a "use alternate segment" flag and don't allow interrupts to fire until after the next instruction executes. I think this changed in later processors, but IIRC on the 8088 if code executes e.g.
SS:REP MOVSB
the source operand will come from the stack segment unless or until an interrupt occurs, but after execution resumes the instruction will resume processing as REP MOVSB
.â supercat
Sep 1 at 17:02
1
1
@supercat - the 80C88 (which I believe is functionally identical) definitely specifies 5 bytes in its data sheet. It also states that the prefetch algorithm of the 8086 (which only fetches when there are 2 bytes free in the queue) is modified to always fetch whenever possible, so it should be able to fill all 5 bytes. As to the execution of 6 byte instructions, my understanding is that the immediate operand isn't required in the buffer, as it is always fetched as part of execution (which is why timings for instructions with an immediate operand are longer than register-register ops).
â Jules
Sep 2 at 23:00
@supercat - the 80C88 (which I believe is functionally identical) definitely specifies 5 bytes in its data sheet. It also states that the prefetch algorithm of the 8086 (which only fetches when there are 2 bytes free in the queue) is modified to always fetch whenever possible, so it should be able to fill all 5 bytes. As to the execution of 6 byte instructions, my understanding is that the immediate operand isn't required in the buffer, as it is always fetched as part of execution (which is why timings for instructions with an immediate operand are longer than register-register ops).
â Jules
Sep 2 at 23:00
 |Â
show 2 more comments
up vote
6
down vote
The 8088 and 8086 are microcoded CPUs, and need multiple cycles to execute each instruction. The fastest instructions take at least 2 cycles to execute, and most take much longer. Any instruction that accesses memory takes at least 8 cycles, and often more like 15-20. On the other hand memory accesses occupy the bus for 4 cycles.
It therefore follows that the only instructions that deplete the queue without chance to refill it are those short 2-3 cycle instructions. These are basically only register-to-register operations, excluding multiplies and divides. Register/immediate instructions take 4 cycles, but also fetch an extra instruction byte, so also deplete the queue.
As long as you disperse reasonable numbers of longer operations (including anything that involves a memory access, or multiply/divide operations) in the instruction stream, it should be unlikely that the instruction buffer underruns other than immediately after a jump.
1
Instructions that access memory tie up the memory bus with the accesses they perform. The time for something like "ADD [1234],AX" is listed as 24+EA, with EA listed as 6, for a total of 30. That seems like that should be enough time to fill the prefetch queue, but the instruction takes four bytes and performs four other memory accesses, tying up the bus interface unit for 32 cycles.
â supercat
Aug 31 at 16:37
add a comment |Â
up vote
6
down vote
The 8088 and 8086 are microcoded CPUs, and need multiple cycles to execute each instruction. The fastest instructions take at least 2 cycles to execute, and most take much longer. Any instruction that accesses memory takes at least 8 cycles, and often more like 15-20. On the other hand memory accesses occupy the bus for 4 cycles.
It therefore follows that the only instructions that deplete the queue without chance to refill it are those short 2-3 cycle instructions. These are basically only register-to-register operations, excluding multiplies and divides. Register/immediate instructions take 4 cycles, but also fetch an extra instruction byte, so also deplete the queue.
As long as you disperse reasonable numbers of longer operations (including anything that involves a memory access, or multiply/divide operations) in the instruction stream, it should be unlikely that the instruction buffer underruns other than immediately after a jump.
1
Instructions that access memory tie up the memory bus with the accesses they perform. The time for something like "ADD [1234],AX" is listed as 24+EA, with EA listed as 6, for a total of 30. That seems like that should be enough time to fill the prefetch queue, but the instruction takes four bytes and performs four other memory accesses, tying up the bus interface unit for 32 cycles.
â supercat
Aug 31 at 16:37
add a comment |Â
up vote
6
down vote
up vote
6
down vote
The 8088 and 8086 are microcoded CPUs, and need multiple cycles to execute each instruction. The fastest instructions take at least 2 cycles to execute, and most take much longer. Any instruction that accesses memory takes at least 8 cycles, and often more like 15-20. On the other hand memory accesses occupy the bus for 4 cycles.
It therefore follows that the only instructions that deplete the queue without chance to refill it are those short 2-3 cycle instructions. These are basically only register-to-register operations, excluding multiplies and divides. Register/immediate instructions take 4 cycles, but also fetch an extra instruction byte, so also deplete the queue.
As long as you disperse reasonable numbers of longer operations (including anything that involves a memory access, or multiply/divide operations) in the instruction stream, it should be unlikely that the instruction buffer underruns other than immediately after a jump.
The 8088 and 8086 are microcoded CPUs, and need multiple cycles to execute each instruction. The fastest instructions take at least 2 cycles to execute, and most take much longer. Any instruction that accesses memory takes at least 8 cycles, and often more like 15-20. On the other hand memory accesses occupy the bus for 4 cycles.
It therefore follows that the only instructions that deplete the queue without chance to refill it are those short 2-3 cycle instructions. These are basically only register-to-register operations, excluding multiplies and divides. Register/immediate instructions take 4 cycles, but also fetch an extra instruction byte, so also deplete the queue.
As long as you disperse reasonable numbers of longer operations (including anything that involves a memory access, or multiply/divide operations) in the instruction stream, it should be unlikely that the instruction buffer underruns other than immediately after a jump.
answered Aug 31 at 11:13
Jules
7,61812139
7,61812139
1
Instructions that access memory tie up the memory bus with the accesses they perform. The time for something like "ADD [1234],AX" is listed as 24+EA, with EA listed as 6, for a total of 30. That seems like that should be enough time to fill the prefetch queue, but the instruction takes four bytes and performs four other memory accesses, tying up the bus interface unit for 32 cycles.
â supercat
Aug 31 at 16:37
add a comment |Â
1
Instructions that access memory tie up the memory bus with the accesses they perform. The time for something like "ADD [1234],AX" is listed as 24+EA, with EA listed as 6, for a total of 30. That seems like that should be enough time to fill the prefetch queue, but the instruction takes four bytes and performs four other memory accesses, tying up the bus interface unit for 32 cycles.
â supercat
Aug 31 at 16:37
1
1
Instructions that access memory tie up the memory bus with the accesses they perform. The time for something like "ADD [1234],AX" is listed as 24+EA, with EA listed as 6, for a total of 30. That seems like that should be enough time to fill the prefetch queue, but the instruction takes four bytes and performs four other memory accesses, tying up the bus interface unit for 32 cycles.
â supercat
Aug 31 at 16:37
Instructions that access memory tie up the memory bus with the accesses they perform. The time for something like "ADD [1234],AX" is listed as 24+EA, with EA listed as 6, for a total of 30. That seems like that should be enough time to fill the prefetch queue, but the instruction takes four bytes and performs four other memory accesses, tying up the bus interface unit for 32 cycles.
â supercat
Aug 31 at 16:37
add a comment |Â
up vote
5
down vote
On the 8088 (which was used in the original IBM PC and early clones thereof), but not on any of the other processors used in subsequent machines, a typical two-byte instruction will take eight cycles to fetch and 2-3 cycles to execute. Because of this disparity between instruction-fetch time and execution time, the memory bus will be almost never be idle except when executing long instructions like multiply/divide. Thus, in many cases the simplest and most accurate way of the estimating execution time for a piece of code is to simply count the memory operations and multiply by four. There are some cases that require more precise analysis of the timing relationship between the execution and bus-interface units, but in most of those the simplest way to determine execution time is to feed the code to an 8088 and measure it.
Interestingly, many books from the era when the 8088 was the dominant processor ignore instruction-fetch times. Consequently, they give optimization advice which would tend to be reasonable on future machines using the 8086 and 80286, but is in many cases just plain wrong on the 8088. For example, some such books suggest that if one needs to move DX to AX or vice versa, and don't care about what happens to the other register, using "XCHG AX,DX" will be smaller but slower than using "MOV AX,DX" or "MOV DX,AX". While that may have be a speed/space trade-off on some later CPUs, on the 8088 used in the PC the "slower" form is in practice almost always faster.
Note that even though effective address computation times may appear long, they are rarely sufficient to give the bus unit any idle time. An instruction like "add [bx+1234],ax" is four bytes long, which means it will take 16 cycles to fetch, and it will have to perform four bus operations (tying up the memory unit another 16 cycles) when it executes. One might look at the execution-unit time and think it's so huge that the bus unit would catch up, but since the instruction requires performing 8 bus operations, the time required for the effective address calculation really doesn't matter. If one did nothing but memory operations using short instructions, the bus interface unit might catch up with the execution unit (many short instructions that access memory have execution times that are a cycle or two longer than the time required to perform all the memory accesses) but most two-cycle instructions that don't access memory give the execution unit a 5-6 cycle advantage, so for most typical instruction mixes the memory-access times will dominate.
2
It's probably worth mentioning that the 8088 was the CPU in the original IBM PC, and there were far more PCs with 8088s than 8086s.
â Ross Ridge
Aug 31 at 19:35
add a comment |Â
up vote
5
down vote
On the 8088 (which was used in the original IBM PC and early clones thereof), but not on any of the other processors used in subsequent machines, a typical two-byte instruction will take eight cycles to fetch and 2-3 cycles to execute. Because of this disparity between instruction-fetch time and execution time, the memory bus will be almost never be idle except when executing long instructions like multiply/divide. Thus, in many cases the simplest and most accurate way of the estimating execution time for a piece of code is to simply count the memory operations and multiply by four. There are some cases that require more precise analysis of the timing relationship between the execution and bus-interface units, but in most of those the simplest way to determine execution time is to feed the code to an 8088 and measure it.
Interestingly, many books from the era when the 8088 was the dominant processor ignore instruction-fetch times. Consequently, they give optimization advice which would tend to be reasonable on future machines using the 8086 and 80286, but is in many cases just plain wrong on the 8088. For example, some such books suggest that if one needs to move DX to AX or vice versa, and don't care about what happens to the other register, using "XCHG AX,DX" will be smaller but slower than using "MOV AX,DX" or "MOV DX,AX". While that may have be a speed/space trade-off on some later CPUs, on the 8088 used in the PC the "slower" form is in practice almost always faster.
Note that even though effective address computation times may appear long, they are rarely sufficient to give the bus unit any idle time. An instruction like "add [bx+1234],ax" is four bytes long, which means it will take 16 cycles to fetch, and it will have to perform four bus operations (tying up the memory unit another 16 cycles) when it executes. One might look at the execution-unit time and think it's so huge that the bus unit would catch up, but since the instruction requires performing 8 bus operations, the time required for the effective address calculation really doesn't matter. If one did nothing but memory operations using short instructions, the bus interface unit might catch up with the execution unit (many short instructions that access memory have execution times that are a cycle or two longer than the time required to perform all the memory accesses) but most two-cycle instructions that don't access memory give the execution unit a 5-6 cycle advantage, so for most typical instruction mixes the memory-access times will dominate.
2
It's probably worth mentioning that the 8088 was the CPU in the original IBM PC, and there were far more PCs with 8088s than 8086s.
â Ross Ridge
Aug 31 at 19:35
add a comment |Â
up vote
5
down vote
up vote
5
down vote
On the 8088 (which was used in the original IBM PC and early clones thereof), but not on any of the other processors used in subsequent machines, a typical two-byte instruction will take eight cycles to fetch and 2-3 cycles to execute. Because of this disparity between instruction-fetch time and execution time, the memory bus will be almost never be idle except when executing long instructions like multiply/divide. Thus, in many cases the simplest and most accurate way of the estimating execution time for a piece of code is to simply count the memory operations and multiply by four. There are some cases that require more precise analysis of the timing relationship between the execution and bus-interface units, but in most of those the simplest way to determine execution time is to feed the code to an 8088 and measure it.
Interestingly, many books from the era when the 8088 was the dominant processor ignore instruction-fetch times. Consequently, they give optimization advice which would tend to be reasonable on future machines using the 8086 and 80286, but is in many cases just plain wrong on the 8088. For example, some such books suggest that if one needs to move DX to AX or vice versa, and don't care about what happens to the other register, using "XCHG AX,DX" will be smaller but slower than using "MOV AX,DX" or "MOV DX,AX". While that may have be a speed/space trade-off on some later CPUs, on the 8088 used in the PC the "slower" form is in practice almost always faster.
Note that even though effective address computation times may appear long, they are rarely sufficient to give the bus unit any idle time. An instruction like "add [bx+1234],ax" is four bytes long, which means it will take 16 cycles to fetch, and it will have to perform four bus operations (tying up the memory unit another 16 cycles) when it executes. One might look at the execution-unit time and think it's so huge that the bus unit would catch up, but since the instruction requires performing 8 bus operations, the time required for the effective address calculation really doesn't matter. If one did nothing but memory operations using short instructions, the bus interface unit might catch up with the execution unit (many short instructions that access memory have execution times that are a cycle or two longer than the time required to perform all the memory accesses) but most two-cycle instructions that don't access memory give the execution unit a 5-6 cycle advantage, so for most typical instruction mixes the memory-access times will dominate.
On the 8088 (which was used in the original IBM PC and early clones thereof), but not on any of the other processors used in subsequent machines, a typical two-byte instruction will take eight cycles to fetch and 2-3 cycles to execute. Because of this disparity between instruction-fetch time and execution time, the memory bus will be almost never be idle except when executing long instructions like multiply/divide. Thus, in many cases the simplest and most accurate way of the estimating execution time for a piece of code is to simply count the memory operations and multiply by four. There are some cases that require more precise analysis of the timing relationship between the execution and bus-interface units, but in most of those the simplest way to determine execution time is to feed the code to an 8088 and measure it.
Interestingly, many books from the era when the 8088 was the dominant processor ignore instruction-fetch times. Consequently, they give optimization advice which would tend to be reasonable on future machines using the 8086 and 80286, but is in many cases just plain wrong on the 8088. For example, some such books suggest that if one needs to move DX to AX or vice versa, and don't care about what happens to the other register, using "XCHG AX,DX" will be smaller but slower than using "MOV AX,DX" or "MOV DX,AX". While that may have be a speed/space trade-off on some later CPUs, on the 8088 used in the PC the "slower" form is in practice almost always faster.
Note that even though effective address computation times may appear long, they are rarely sufficient to give the bus unit any idle time. An instruction like "add [bx+1234],ax" is four bytes long, which means it will take 16 cycles to fetch, and it will have to perform four bus operations (tying up the memory unit another 16 cycles) when it executes. One might look at the execution-unit time and think it's so huge that the bus unit would catch up, but since the instruction requires performing 8 bus operations, the time required for the effective address calculation really doesn't matter. If one did nothing but memory operations using short instructions, the bus interface unit might catch up with the execution unit (many short instructions that access memory have execution times that are a cycle or two longer than the time required to perform all the memory accesses) but most two-cycle instructions that don't access memory give the execution unit a 5-6 cycle advantage, so for most typical instruction mixes the memory-access times will dominate.
edited Aug 31 at 19:50
answered Aug 31 at 16:02
supercat
5,448532
5,448532
2
It's probably worth mentioning that the 8088 was the CPU in the original IBM PC, and there were far more PCs with 8088s than 8086s.
â Ross Ridge
Aug 31 at 19:35
add a comment |Â
2
It's probably worth mentioning that the 8088 was the CPU in the original IBM PC, and there were far more PCs with 8088s than 8086s.
â Ross Ridge
Aug 31 at 19:35
2
2
It's probably worth mentioning that the 8088 was the CPU in the original IBM PC, and there were far more PCs with 8088s than 8086s.
â Ross Ridge
Aug 31 at 19:35
It's probably worth mentioning that the 8088 was the CPU in the original IBM PC, and there were far more PCs with 8088s than 8086s.
â Ross Ridge
Aug 31 at 19:35
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fretrocomputing.stackexchange.com%2fquestions%2f7440%2fhow-to-keep-the-instruction-prefetcher-filled-up%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Does the same go for the 8086?
â Wilson
Aug 31 at 9:59
yes, although the 8086 queue is longer, so underruns are less likely.
â Jules
Aug 31 at 10:46
1
Optimizing for the 8088 is different from optimizing for any other processor in the series; for whatever reason, most of what was written about optimizing in that era--even when PC clones were essentially all 8088-based--would have applied to later processors, but not the 8088. On the 8088, focus on instruction length and memory accesses. If you need the value that's in DX in AX, and don't care if it stays in DX, an "XCHG AX,DX" that supposedly takes 3 cycles (but in practice takes four) will vastly outperform a "MOV AX,DX".(nominally 2 cycles, but in practice eight).
â supercat
Aug 31 at 16:10
2
@Wilson: On the 8086, instruction fetches and 16-bit accesses only take half as long as on the 8088, but the execution unit is the same speed. On the 8088, overall performance is so thoroughly dominated by memory accesses that from an optimization perspective one can often ignore practically everything else other than multiplies and divides. The 8086, however, is much more balanced, and will much more often need to wait on the execution unit, making its performance far more important.
â supercat
Aug 31 at 16:20
1
@supercat You seem to responding to something I didn't say. I was only talking about optimizations that would improve the effectiveness of the instruction prefetcher, specifically what the original poster calls "peephole optimisers that reorder the instruction stream to increase the likelihood that some of the later instructions can be fetched ahead of time." Compilers didn't do this, and near as I can tell neither did assembly language programmers, probably because it would be ineffective for the reasons you gave in your answer.
â Ross Ridge
Sep 1 at 17:52