esn't even do branch prediction -- the first x86 chip without that feature since 1993. At first glance, it resembles a 1980s-vintage 486.
Stranger still, the IDT-C6 is the debut product from an unknown start-up company. Centaur is a new subsidiary of Integrated Device Technology (IDT), which is a well-known manufacturer of static RAM (SRAM) chips and Rx000-series RISC processors under license from Silicon Graphics/Mips. However, IDT has not had any previous experience with the x86 architecture.
Internally, the IDT-C6 has little in common with other fifth- and sixth-generation x86 processors. Yet according to Centaur, it closely matches the performance of a multimedia extensions (MMX) Pentium when running the Winstone 97 business benchmark (37.7 versus 37.5 Winstones at 200 MHz). And as the table
"Processors Compared"
indicates, it has a much smaller die size than a Pentium, which means it should cost significantly less.
However
, at this writing, Centaur had not yet announced prices, and BYTE was unable to verify the performance claims by running the BYTEmark suite or Bapco's Sysmarks. Although Centaur was showing samples of the IDT-C6 in May and June, final-production silicon wasn't expected until mid-August -- too late to benchmark for this issue.
When BYTE does test a production chip, it will likely finish behind an identically clocked Pentium on the BYTEmarks. Although BYTEmark programs use real-world algorithms, they are still CPU-intensive synthetic benchmarks. Centaur agrees that its chip will do better with application-level benchmarks, such as the Winstone or Sysmark suites.
The reason for this is the processor's ascetic design. The IDT-C6 sacrifices raw core throughput to gain other advantages: large internal caches (32 KB each for instructions and data), high clock speeds (150, 180, and 200 MHz to start, with 225 and 240 MHz likely this fall), low power consumption (14 W maximum at 200 MHz for the desktop chip
, and 7.1 to 10.6 W for the mobile chips), a tiny die size (88 square millimeters), and rapid upgrades (Centaur hopes to deliver improved versions every six to 12 months).
One at a Time
The idea of a streamlined x86 processor has been cooking for years in the mind of Glenn Henry, Centaur's president. He is a former IBM Fellow and RISC pioneer who came to IDT by way of Dell and Mips. At his last job, Henry worked on a hybrid RISC/CISC processor that could execute both the Rx000 and x86 instruction sets.
That project fizzled, but Henry took his ideas to IDT. In April 1995, Henry and his first three engineers sat down at his kitchen table in Austin, Texas, to sketch out the IDT-C6. They conceived a chip that had a single six-stage instruction pipeline. That alone was heresy. Virtually all of today's processors -- both CISC and RISC -- are superscalar devices. This means they have multiple pipelines that execute two or more instructions at once. The exceptions are low-cost embedded processors
.
The decision to have only a single pipeline immediately saved millions of transistors (and the associated complexity). Superscalar processors need complex logic to control the flow of instructions through their parallel pipes. The latest CPUs -- such as Intel's Pentium II and Pentium Pro, AMD's K6, and Cyrix's 6x86MX -- can also execute multiple instructions out of order before retiring the results in original program order.
Centaur's chip is obviously a strict in-order machine, because it executes only one instruction at a time. That saves even more transistors, because it doesn't need a reorder buffer, rename registers, or the extra control logic to manage all that instruction shuffling.
Because of these design decisions, the IDT-C6 requires significantly less testing than a more complex CPU. "Trying to design and verify an out-of-order superscalar processor is a real problem for everybody, especially for an x86," notes Henry. "Only two years later, we're sampling our Pentium-class proce
ssor."
That's about half the time it takes to design and verify most other CPUs. NexGen labored for eight years on its first x86 chip. Intel is spending about five years on Merced.
The Branch Not Taken
Raising even more eyebrows among the digerati, Henry decided to omit branch prediction, too. Although this decision eliminates a branch target buffer and other related circuitry, it appears to be an odd trade-off. Branches are so common in modern code (about one for every five instructions) that it seems as if a little extra complexity could significantly boost throughput.
To understand why the company made this decision, take a closer look at the chip's pipeline, as shown in the figure
"A Straightforward Pipeline"
. It's similar to a 486 pipeline (fetch, decode, address calculation, execute, writeback) except for an additional translate stage (stage 2). During that stage, the IDT-C6 translates x86 instructions into simpler, 33-bit-long microinstructions or retrie
ves microcode from its internal ROM, much as other x86 chips do. In stage 3, the chip fully decodes the instruction and accesses the registers. In stage 4, it evaluates branches.
If the program doesn't branch at this point, stage 4 takes only 1 clock cycle, so instructions keep flowing and life is beautiful. However, if the program does branch, the CPU must fetch the target instruction from the cache and herd it through the pipeline, which consumes 4 clock cycles. Most branches aren't taken, so the IDT-C6 averages about 2.5 clock cycles per branch.
By comparison, a Pentium needs only 1 clock cycle per branch if it correctly predicts the outcome. However, if a Pentium guesses wrong, it needs 4 or 5 clock cycles to recover. Henry calculates that a Pentium averages about 1.8 clock cycles per branch. In his judgment, the Pentium's extra complexity buys only a little more efficiency.
For all its simplicity, the IDT-C6 still has a few tricks to speed execution. The IDT-C6 has an eight-entry call-r
eturn stack. When a program branches, the CPU pushes the return address onto this internal stack. Most other CPUs would store and retrieve the address from memory. Centaur predicts that the IDT-C6 will save a slow memory access by pulling the address off the return stack about 90 percent of the time.
Another special feature is a cache that holds eight entries from the page-directory table, a lookup table that x86 processors use to access memory. About 90 percent of the time, the IDT-C6 finds the pointer it needs in the cache instead of looking in the table, which saves yet another memory access. And to keep complex instructions from paralyzing the chip's lone pipeline, the IDT-C6 also has a special queue incorporated into stage 2 that lets it fetch and translate up to three instructions while executing another instruction.
In other words, the IDT-C6 isn't as primitive as it first appears. It's not just a recycled 486 chip with MMX tacked on. Rather, it's a bold attempt to quickly produce an x86 pr
ocessor that offers competitive performance at an affordable price.
"We're going to get hit by all the technical journals because we don't have superscalar pipelines and out-of-order execution and all that other stuff," says Henry. "But microprocessors ought to be commodities. Our theme was to develop a chip for the common masses. This project was my labor of love."