Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesAMD's 29030 Microprocessor


January 1995 / Core Technologies / AMD's 29030 Microprocessor

Based on the proven 29K RISC core, this processor is ideal for embedded applications

Floyd Goodrich

You can find Advanced Micro Devices' family of 29000 RISC microprocessors and integrated RISC microcontrollers throughout several large segments of the office automation market. They are used in printers (both color and black-and-white) and function as both the printers' controllers and rasterizers. In telecommunications work, they act as central office switches. They can be found in networks, controlling routers and hubs and managing RAID arrays for storage applications. The high performance of these 29000 RISC processors, especially their ability to manipulate and transfer data at high speeds, makes them an ideal fit in these markets.

These 29000 processors, better known as the 29K family, comprise three product lines. First are the general-purpose RISC processors, which use a modified Harvard architecture and have separate buses for instruction and data accesses. Next is a series of microprocessors that have on-chip caches and thus require only a single bus for instruction and data fetches. The final group is a series of integrated RISC microcontrollers. Before I delve into the details of a specific processor, I'll provide a basic description of the 29K microarchitecture. Later, I'll use that information to expand on the feature set of the AMD 29030, a RISC microprocessor tailored for demanding embedded applications.

29K Roots

All 29K family members use the same 32-bit core microarchitecture and compatible object code. The microarchitecture's instruction set consists of fixed-length 32-bit instructions. The 29K core supports many standard RISC features, such as pipelining, load overlapping and forwarding , and architectural parallelism. But the 29K core also includes several unique high-performance features, such as a large register file and fast interrupt handling.

Among the 29K microarchitecture's standard RISC features is a four-stage (fetch, decode, execute, and write back) pipeline that can process, on average, an instruction every 1.26 clock cycles for typical application code. Like all RISC processors, the 29K relies on a load/store architecture. This lets a compiler or a programmer minimize or eliminate pipeline stalls by scheduling memory accesses within the instruction stream.

An on-chip, four-instruction prefetch buffer decouples the pipeline execution speed from the memory speed. Most instruction accesses come from the instruction cache, but when the processor must fetch instructions from memory, the prefetch buffer minimizes delays. By sending the first instruction fetched from memory directly to the pipeline instead of waiting for all four instructions within a cache block to arriv e, the prefetch buffer minimizes the latency associated with instruction-cache misses.

In addition, the 29K core has methods for allowing instructions to continue executing while memory accesses occur. Appropriately scheduling load and store operations is critical for high-performance operation, because over a quarter of all instructions involve memory accesses. To promote efficiency, the 29K core uses overlapping and forwarding techniques to eliminate pipeline stalls that would otherwise arise when the pipeline has to wait for data. With proper scheduling, the 29K core can typically maintain single-cycle execution through load instructions, provided the results of the load operation are not immediately required. Interlocks between pipeline stages ensure that this parallelism cannot result in the incorrect operation of the 29K core. In addition, the pipeline receives the results of a load instruction as soon as the bus interface latches the data; it need not wait for an entire buffer to fill before the critical instruction is sent to the pipeline.

Finally, the 29K core uses three-operand instructions: Its instructions use two registers as sources, and a third register serves as the destination of the operation. This differs from a two-operand instruction architecture, in which numerous register-to-register move instructions must be added to a RISC program simply to preserve one of the source operands.

Because the source oper-ands are preserved, fewer instructions are required to combat data destruction, and 29K processor programs are more compact. Also, the three-operand instruction format closely matches compiler-generated data structures, making it a more natural fit for compiler-generated code.

Unique Core Details

The characteristics mentioned thus far are standard fare that all RISC processors use to improve throughput. But the 29K core has several unique features that boost performance. First, it has a huge register file. Other RISC processors might have a r egister file with 32 or 64 entries, but the 29K has a 192-entry register file. This enables a compiler to assign all of a procedure's local variables to registers, avoiding the penalty of using load/store operations to store these variables in RAM. To maximize performance, this register file is triple-ported, which allows it to supply two source operands and receive one destination operand at the same time and which makes access faster than it would be from an on-chip cache. In addition, the register file is available to an earlier pipeline stage (the decode stage) than a cache would be, which shaves the fetch stage from the pipeline. But the greatest benefit of the large register file is the elimination of save-and-restore code on procedure calls. Removing this type of code can improve procedure-call performance by as much as a factor of 10.

The 29K core does not save the state of a machine when an interrupt or exception occurs, which makes interrupt-handling routines extremely fast. A systems program mer can decide to write code to save the state of the machine or elect not to preserve the machine's state and offer fast interrupt service.

The Am29030

The Am29030 incorporates the 29K core, along with certain performance-enhancing features. The 29030 uses the 29K family's 32-bit architecture and is implemented in CMOS. It has clock speeds of 20, 25, and 33 MHz. The 29030 has an on-chip, 8-KB, two-way set-associative instruction cache; an integrated memory management unit; and scalable clocking that lets you get high performance using low-cost memory. These and other features make the 29030 attractive to the embedded-control market.

Traditionally, the embedded-control market has had a fixed set of requirements for success. First is object-code compatibility. The time-to-market requirements of embedded applications place incredible pressure on software engineers to create good, stable code in a minimum of time. Object-code compatibility thus ensures that a new project can reuse field-tested procedures drawn from a stable of reliable and well-understood program code.

Another key to success in the embedded market is restricting the use of peripheral interface hardware. Embedded control applications are generally cost-sensitive--so much so that the design should not demand that additional money be spent on components. A bus interface that is simple, yet it supports high-speed transfers is highly desirable for these applications.

The 29030 bus supports accesses to 8-, 16-, and 32-bit instruction memory and accesses to 16- or 32-bit data memory. This lets a system designer select the appropriate memory width, given the performance and cost constraints. For example, to achieve the highest performance, a designer might have the 29030 copy a program out of inexpensive 8-bit ROMs into 32-bit memory, then execute the program in RAM.

The 29030 supports burst transfers up to 1 KB in length. In these transfers, the processor can achieve single-cycle transfers of 32 bit s, to or from memory. This high sustained transfer rate lets you fetch instructions quickly even when using inexpensive paged-mode DRAM, and it also supports fast software-controlled transfers of data to and from inexpensive bursting memory. Because data accesses can be big-endian or little-endian, the 29030 can be connected to a variety of peripherals. The 29030 uses three lines to support conventional and burst transfers. Interface complexity is further reduced by using two synchronous buses: an address bus of 32 lines and a data/instruction bus of 32 lines ( see the figure ). This reduces the board area required by the processor and the number of bus connections; it also lowers the parts count for the memory subsystem. Because the 29030 bus interface is straightforward, a hardware designer does not need to spend a great amount of money or time on system glue logic.

The scalable clock is an on-chip phased-locked loop that lets the 29030 processor run internally at full speed (say, 33 M Hz) while the external bus runs at half speed (16.67 MHz). In this configuration, the processor, with its instruction cache, provides high performance but uses low-cost memory. As with other parts of the 29030, the designer of a system that uses less expensive memory may decide whether to use scalable clocking to contain costs or obtain maximum performance by using faster memory.

By squeezing high performance from inexpensive memory, the 29030 achieves an attractive system cost-to-performance ratio. It is important to note that the cost-to-performance ratio should be based on the cost of the entire system, not simply on the processor's cost. Processors cheaper than the 29030 are on the market, but they require more expensive system components and memory, which drives up the total cost of the system.

Another unique feature of the 29030 is traceable caching, which lets an emulator or any generic postprocessor reconstruct a real-time code trace that is visible in the cache. Typically, a system engi neer must turn off a processor's on-chip cache to force visible transactions on the bus for debugging purposes. This is inadequate for quickly solving complex program interactions involving on-chip caches. The 29030 implements cache tracing by using a second 29030 on an ICE (in-circuit emulator) acting as a slave while the 29030 on the controller board acts as the master. The master processor executes the program and generates all bus transactions for both processors. The slave processor executes the same instructions but uses its address bus to drive all cached branch target addresses. This lets the ICE read these addresses and construct a trace of program flow inside the cache. To ensure prompt visibility of a cache and improve the debugging process, AMD expects to deploy this technology on all 29K products containing on-chip caches.

Future Directions

One further advantage of the 29030 is a simple upgrade path. You can move up to the next-generation microprocessor, the Am29040, without changing the design of a system. The 29040 adds an on-chip data cache and hardware integer multiplier, in addition to running at higher frequencies internally. Because the 29030 and the 29040 are object-code compatible, you can run the same software. The 29040 is also bus-compatible and footprint-compatible with the 29030. It can plug into the same socket as the 29030, and it interfaces to the same logic as the 29030 does. The 29040 provides a clean migration path while limiting development time and maintaining the attractive cost-to-performance ratio that embedded applications demand.


AMD 29030

illustration_link (58 Kbytes)

The AMD 29030 uses both a high-performance 29K RISC core and a simple bus interface.


Floyd Goodrich is a technical marketing manager in AMD's Embedded Processor Division in Austin, Texas. He has a B.S.E.E. degree from Ric e University and did product development and applications engineering at Motorola for eight years. You can reach him on the Internet at floyd.goodrich@amd.com or BIX c/o "editors."

Up to the Core Technologies section contentsGo to previous article: FPU PrecisionGo to next article: The Oberon/F SystemSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network