Two different designs achieved the same goal: a faster 680x0 emulator for the Mac
Tom Thompson
In March, Apple released version 2.0 of MAE (Macintosh Application Environment), a program that hosts the Mac OS in a Unix window on Sun SparcStations or Hewlett-Packard's HP 9000 workstations. MAE 2.0 offers better Mac 680x0 application performance because it uses a faster 680x0 emulator. The Power Mac 9500, introduced this summer, also gets a performance assist from a new 680x0 emulator. What's interesting, and the focus of this column, is that both designs use the same technique--dynamic recompilation--to improve performance.
The Interpretive Emulator
To understand how these new emulators work, we
must first explain how the original 68LC040 emulator operates. It consists of a lookup dispatch table and a PowerPC code library. The code library contains functions that implement each 680x0 instruction, and entries in the dispatch table point to these functions. The dispatch table also has entries for 680x0 processor A- and F-line exceptions (or traps). Apple uses the A-line trap as the entry point into its Mac Toolbox routines, and the F-line trap handles certain hardware-specific traps (e.g., address or bus errors). The emulator has a 580-KB footprint in ROM.
The emulator operates by fetching a 16-bit 680x0 instruction. (Instructions can be 32 bits or longer, but the first 16 bits define the instruction's function.) This value acts as an index to an entry in the dispatch table, and each table entry consists of two PowerPC instructions. For a simple 680x0 instruction, the first PowerPC instruction handles the operation in-line, and the second instruction returns execution back to the emulator. For
some 680x0 instructions, the second native instruction is a PC-relative branch to a code library function. The function's native instructions complete the operation, and control returns to the emulator (see the figure
"The Basic 68LC040 Emulator"
).
All this design does is interpret one 680x0 instruction at a time, all the time, and is thus known as an interpretive emulator. Interpretive emulation isn't efficient when sections of code are executed frequently (e.g., in tight loops). However, in Apple's case, it provided the best compatibility with existing 680x0 software.
Dynamic recompilation (or DR) offers better efficiencies during emulation by "recompiling" sections of frequently used 680x0 instructions into chunks of native code. Rather than laboriously interpret each 680x0 instruction inside, say, a loop, the DR emulator hops to a native-code block that performs the looping operation.
The MAE Implementation
The MAE DR emulator is actually
an enhancement built onto the proven interpretive emulator. Because it's part of a program running on a workstation, the MAE DR emulator operates differently from the Power Mac DR emulator. The MAE emulator has to implement basic services normally provided through Apple hardware. However, it can also rely on certain low-level support, such as interrupt handling and disk I/O, from the workstation's OS.
The first task the DR emulator performs is to identify frequently used sequences of 680x0 instructions, or
hot blocks
. Marking a block's starting point is easy: It's the target of an emulated branch instruction. A block's end is determined by a change of program flow to a distant address, and resolving this properly gets tricky.
Several instances are used to discern these flow changes. The first one can be a return instruction, provided the return address isn't to a nearby location, for reasons we'll see. (This return instruction is an unconditional branch under RISC.) The second instance
is a conditional branch instruction, but only if the target address is nonlocal.
One reason that there are no hard-and-fast rules for the first two instances is that high-level-language compilers frequently implement control statements as conditional branch instructions. These instructions test for conditions that, if satisfied, perform short jumps around a branch instruction that might exit a loop. This same situation also explains why an unconditional branch instruction (i.e., return) by itself doesn't guarantee the end of a block.
The third instance that marks a block end are certain complex 680x0 instructions. Recompiling them requires too much overhead and time. The easiest solution is to end the code block. For performance reasons, the MAE emulator tries to make the code blocks as large as possible.
With the potential hot blocks mapped out, the next step is to flag those blocks that are heavily used. This is done with little overhead by pushing the target addresses of 680x0 branch
instructions onto the native stack. A frequency-of-use analysis is performed on the addresses, and those blocks that are executed more than 256 times per tick (i.e., 1/60 second) are recompiled.
Recompilation involves copying the PowerPC instructions out of the emulator's code library one at a time and performing postprocessing on the native representation of the hot block. Such postprocessing involves dead code removal and code optimizations (e.g., embedding data constants). The emulator does the postprocessing code generation rapidly by fetching and modifying data from 680x0 instruction templates. These templates were built in memory when the MAE process was launched.
Finally, the native-code blocks are placed in a cache buffer. This buffer's size is dynamic, usually hovering around 256 KB, but it can expand to 1 MB in an application-intensive environment.
The DR emulator performs sleight of hand so that the interpretive emulator uses the cached code blocks. Recall that a dispatch table
routes execution to the appropriate native code. This table contains 2(16) (65,536) entries for every possible 680x0 op-code variation, but 50,000 of them represent valid op codes.
The DR emulator updates some of the table's invalid entries with pointers to code blocks inside the cache buffer. It patches the 680x0 application's image in memory so that the start of each hot block contains an invalid op code. As 680x0 hot blocks are detected, recompiled, and the corresponding locations in the dispatch table and the application are revised, the interpretive emulator starts jumping to the recompiled code blocks (see the figure
"The Two DR Implementations"
).
The Power Mac Implementation
The Power Mac DR emulator differs from the MAE design because it's responsible for running the OS. Like MAE, the Power Mac DR emulator is an add-on to the old emulator. The design was optimized for low overhead and a small footprint. It consists of 30 KB of hand-tuned Power
PC assembly language code.
The DR emulator sorts out frequently used 680x0 code blocks and recompiles them. As before, the start of a block is a branch instruction, while the criteria that determine the block's end differ from the MAE design. The instances that mark a block's end are an unconditional branch or jump instruction, an illegal instruction, and a complex instruction. Also, a block can be a fixed length of only 128 bytes, or 64 2-byte 680x0 instructions. The emulator maintains a small history table to flag hot blocks. For the Power Mac, the frequency-of-execution threshold value is small (typically less than 10) and was determined empirically.
The emulator uses a fast set of algorithms that recompiles a hot block. The 680x0 instruction value acts as an index into an array of functions, each of which translates an instruction type (e.g., an
ADD.W
, where the parameters are word values located in registers, memory, or a combination of both). The function first emits a general nat
ive
add
instruction. Next, it fills in the rest of the fields so that the PowerPC instruction specifies the location and size of its parameters, such as adding one 16-bit register value to another. An add to memory would generate the appropriate load/store instructions required to move the data to and from memory.
The recompiler stows the finished native instruction into the cache buffer, fetches another 680x0 instruction, and continues this process until the hot block's translation is complete. For blocks with short backward branches (indicating a loop), the recompiler also adds code that monitors hardware interrupts, because the emulator helps implement the Mac OS on a very low level.
The cache buffer is 256 KB in size. The caching algorithm is starkly efficient: When the buffer fills, it purges all the cached blocks and recompilation begins anew. More complex caching schemes added too much overhead to the design, and the high locality of typical code means that the buffer isn't purge
d often.
With the native block cached in the buffer, the DR emulator begins using it by monitoring the 680x0 instruction stream. When the emulator detects a 680x0 branch instruction, it compares the target address (i.e., the potential start of a hot block) with a hashed table of native program counter addresses. If there is a match, compiled code exists and execution hops to the address of the cached code block. If there isn't a match, the history table is updated, and the 680x0 emulator interprets the code.
Performance Wins
The DR emulators add a level of complexity to the original emulator. Also, caching the translated code produces some side effects that can affect compatibility. When code gets written to memory by a program, or the A5 jump table in a 680x0 Mac application's code segment zero gets modified, the cache buffer's contents can fall out of sync with memory. This causes a crash unless care is taken to notify the emulator of the change. The application must c
all one of several Toolbox routines that flush the cache, and the DR emulator honors cache-flushing instructions such as
CPUSH
and
CINV
. Any application whose code was redesigned for the 68040 should work reliably with these new emulators.
The performance gains outweigh the compatibility pitfalls, however. The MAE emulator boosts an application's performance by an average of 50 percent. Certain compute-intense operations, such as an Excel spreadsheet recalculation, see a 100 percent improvement or more. For the Power Mac emulator, native applications see a 10 percent to 15 percent improvement, while emulated applications run 20 percent to 30 percent faster. For some compute-intensive tasks, a speed boost of 200 percent has been observed.
illustration_link (6 Kbytes)

An instruction is interpreted in-line or via a library function.
illustration_link (10 Kbytes)

Both DR emulators are enhancements to the existing design.
Tom Thompson is a BYTE senior technical editor at large. You can reach him on AppleLink as T.THOMPSON or on the Internet or BIX at
tom_thompson@bix.com
.