Rick Grehan
The Pentium currently outperforms the P6 when running 16-bit programs under Windows 3.1 due to a combination of factors. They include the design of the P6 and the hangover of legacy DOS and Windows code.
As described in "Intel's P6" (April BYTE), instructions passed to the P6 are converted into equivalent microoperations that are loaded into a 40-element circular buffer. Instructions in the buffer pass to the execution unit, which processes between three and five instructions simultaneously, if the data for the specific instruction is available.
If instruction B references a particular register, and instruction A, which precedes B in program flow, also writes t
o that register, B must wait for A to complete. Therefore, the fewer the dependencies, the faster the instructions can be delivered to the execution units.
To conserve on the P6's transistor count, Intel decided to shadow (i.e., allow multiple independent instances) the "true" registers as full 32-bit entities only. The result is that any instruction that alters any part of a register will hold up a following instruction that uses any part of the same register, even if the instructions are logically independent. An
ADD AL,6
holds up a
MOV BX,AX
.
If this were a completely 32-bit world (as Intel's engineers had hoped it would be by now), any instruction referencing a register would be held up by, at most, one preceding instruction, and the P6 would "fire on all cylinders." Similarly, if all programs manipulated the CPU registers only 16 bits at a time, the P6 would perform well. Unfortunately, a great deal of code, especially in the DOS and Windows world, manipulates registers as
8-bit entities here, 16-bit entities there, and sometimes 32-bit entities. This "mixing" of data sizes bogs the P6 down, because it has to spend so much time "piecing" the 32-bit registers together from 8- and 16-bit subunits.
Another source of friction for the P6 arises from the ever-dreaded segment registers often manipulated in 16-bit DOS and Windows programs. Again, to skirt what would have been a tremendous multiplication of complexity, the P6 engineers elected not to virtualize the segment registers. So, whereas general CPU registers can be shadowed, only one global instance exists for each segment register. The result is that the arrival of a segment register load instruction "serializes" the CPU: No other instructions can proceed until the load completes.
Furthermore, any instructions that had already been started but appear in the program flow after the segment register load instruction must be dumped and restarted. The "tear it up and start from scratch" tactic is necessary because th
e source for all instructions and data following the segment load is in question.
Ironically, none of this would be of any significance if the designers of the P6 hadn't made a few excusable miscalculations. In one of the larger mispredicted branches we've ever seen, the P6 engineers in 1990 estimated that most code today would be 32 bits, and that the standard for chip technology, including the Pentium, would be at 0.6 micron running at around 100 MHz. However, hardware again outpaced software. Today's typical PC runs a mixture of 16-bit code on 32-bit OSes. Meanwhile, the latest Pentium is produced on a 0.35-micron process and soon will run at 150 MHz.
The first P6 will not be manufactured on a 0.35-micron process, however. Instead, Intel says it will make the first P6 chips on a more conservative 0.6-micron process. Once it has worked the bugs out at 0.6 microns, Intel says it will move to a more aggressive 0.35-micron process. The company estimates there will be an eight-month period when a
similarly clocked Pentium will outpace the P6 in the special circumstances we've described. But once Intel moves to 0.35-micron manufacturing, the P6 will race ahead.