Often the biggest factor in data cache management is how an algorithm accesses its data. Since maintaining locality in data references (that is, accessing memory whose addresses are adjacent or close together) is critical to maximizing cache performance, structuring program code with the memory system in mind often pays big performance dividends. The programmer can use a couple relatively simple strategies to improve performance.
The first strategy involves data layout. Decisions regarding the organization of aggregate data structures often have a significant impact on data cache utilization. For example, whether you define an array of 100 structures versus a single structure whose members are each 100-element arrays can literally double or triple the attainable bandwidth of the data cache, depending on the data access patterns of the application. As the figure
"Data Organization Boosts Performance"
shows
, an array of 100 structures can scatter data accesses throughout main memory, causing frequent flushing and reloading of the cache. A structure with 100-element arrays has these items occupying adjacent addresses, so any references to them will probably find the items in the cache.
The second strategy is to make use of processor-specific features. Most modern microprocessor instruction sets provide a mechanism for managing the data cache under software control. The PowerPC architecture contains several user-accessible instructions for manipulating the data cache that can significantly improve overall application performance (see the table
"Data Cache Instructions"
).
It's important to define what a
block
is in this context. A block is the fundamental unit of memory that the cache works with. The cache handles all memory loads and stores using blocks. Block size can vary from processor to processor. For example, the PowerPC 601 uses 64-byte blocks, while the 603 and 604 u
se 32-byte blocks.
Each of these instructions operates on a pair of general-purpose register operands, whose sum forms the effective address of the memory location(s) to be affected. The "block touch" (
dcbt
) and "block touch for store" (
dcbtst
) instructions are essentially hints to the processor that the addressed data block is to be loaded, or at least allocated, in the data cache. When placed appropriately ahead of the anticipated need for data from memory, the
dcbt
instruction can be used to bring data from memory (or a secondary cache) into the primary data cache, thus dodging the performance hit of a cache miss. The
dcbtst
instruction behaves in a similar manner except it gives the processor the additional hint that the corresponding memory location is going to be overwritten soon. Since the
dcbt
and
dcbtst
instructions do not actually modify memory (that is, they affect the contents and state of the data cache itself), there are no exceptions tak
en if the address translation fails. If the requested block already resides in the data cache, these instructions are treated as no-ops by the processor.
The "block flush" (
dcbf
) and "block store" (
dcbst
) instructions force modified (or dirty) data out of the cache and back into memory. The primary difference between these two instructions in uniprocessor mode is that
dcbf
not only copies data back to memory (as does
dcbst
), it also marks the corresponding cache block as invalid. Since both of these instructions can change data in memory, exceptions related to address translation are handled the same as they would be for a normal load instruction from the addressed memory location.
Perhaps the most unusual of the data cache manipulation instructions in the PowerPC architecture is "block set to zero" (
dcbz
). The
dcbz
instruction allocates a block of data in the cache and then initializes it to a series of zeroes; thus, it modifies data in the cache.
The
dcbz
instruction can be a very powerful means of boosting performance when zeroing a large block of data. It also can lead to nonportable code if care is not taken. Since not all implementations of the PowerPC have the same data cache block size, the amount of memory actually zeroed using the
dcbz
instruction may vary from one processor to another. Thus, if you wish to zero a fixed-size data array, the number of
dcbz
instructions required varies between processors of differing cache block sizes. However, the cache block size can be determined once per application launch, prior to the first use of the dcbz instruction. A function similar to the pseudocode in the listing
"Measuring Block Size"
measures the data cache block size for PowerPC processors.
Instruction Impact
The instruction cache is often overlooked by programmers as a source of performance improvement, even though every executed instruction passes through this potential
system bottleneck. Traditional compiler code optimizations such as function inlining and loop unrolling are frequently cited as a means of improving performance by exposing additional information to the compiler optimizer. However, if not used carefully, such transformations can have an impact on instruction cache performance by reducing the locality properties in the program code.
One means of getting the benefits of function inlining while avoiding lost instruction cache locality is to use profiling-directed feedback. This feedback guides the compiler in selecting only those performance-critical functions to be transformed into inline code. Profiling-directed feedback involves a two-step compilation process in which an application is first built with special instrumentation code that keeps track of where dynamic execution time is spent. Executing the instrumented application automatically gathers profiling statistics for one or more invocations of the application. This profiling data is fed back into th
e compiler during a second compilation of the application, where it inlines key functions at only those sites where they are called frequently.
The profiling data also permits the compiler to rearrange basic blocks so that infrequently executed code gets moved to a different range of addresses than the most frequently executed code. For example, consider error-checking code: It is necessary to ensure an application's robust operation, yet it is rarely executed. Such code is often offset from the main flow of execution by a conditional language construct, such as a C
if
statement. The compiler can relocate such basic block(s) representing the body of the error-checking code elsewhere so that the main body of code can occupy the instruction cache.
The benefits of such an approach are applicable to a wide range of programs, particularly those in which execution time is not spent primarily within a few small program loops. Consider the speedups obtained using a combination of basic block restruct
uring and cross-file function inlining for some well-known C applications that are part of the SPEC95 benchmark suite. The Gnu C compiler (gcc) and a Lisp interpreter (li) obtained speedups of greater than 20 percent when profiling-directed feedback was used to guide function inlining and basic block placement. Two other applications, an instruction-set simulator (m88ksim) and an object-oriented database transaction benchmark (vortex), each obtained speedups better than 10 percent.
dcbf rA,rB data cache block flush
dcbst rA,rB data cache block store
dcbt rA,rB data cache block touch
dcbtst rA,rB data cache block touch for store
dcbz rA,rB data cache block set to zero
/*
* Assumes mem[] is aligned on 256-byte boundary within a page,
* and that 4 <= block_size <= 256 bytes as a mult
iple of 4 bytes.
*/
/* Set a 256 byte block of memory to all 1's */
for ( i = 0; i < 64; i++ )
mem[i] = 0xffffffff;
/* Execute a dcbz instruction at location mem[0] */
__dcbz (mem, 0);
/* Place a "guard word" of 1's at end of 256-byte block */
mem[64] = 0xffffffff;
/* Look for first non-zeroed word */
for ( i = 0; mem[i] == 0; i++ )
/* empty loop body */ ;
block_size = i * 4;
illustration_link (19 Kbytes)

An improper data arrangement (
at left
) has the processor flushing and reloading its on-chip caches.
Mike Phil
lip is manager of PowerPC Compiler and Tools Development for Motorola. He can be reached at
phillip@risc.sps.mot.com
.