Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesZen and the Art of Cache Maintenance


March 1997 / Core Technologies / Zen and the Art of Cache Maintenance

Proper understanding and use of the PowerPC's on-chip caches can improve an application's performance.

Mike Phillip

The growing disparity in clock speeds between the processor and DRAM, combined with an increasing degree of processor superscalability, has made effective cache management a critical factor in obtaining optimal application performance. With main memory access now taking anywhere from tens to hundreds of CPU clock cycles, the difference between finding desired code or data in the on-chip cache instead of in main memory can completely dominate the effective speed of an application, or even an entire system. As a developer, there are a variety of techniques available that let you more effectively make use of the processor cache s.

Dealing with Data

Often the biggest factor in data cache management is how an algorithm accesses its data. Since maintaining locality in data references (that is, accessing memory whose addresses are adjacent or close together) is critical to maximizing cache performance, structuring program code with the memory system in mind often pays big performance dividends. The programmer can use a couple relatively simple strategies to improve performance.

The first strategy involves data layout. Decisions regarding the organization of aggregate data structures often have a significant impact on data cache utilization. For example, whether you define an array of 100 structures versus a single structure whose members are each 100-element arrays can literally double or triple the attainable bandwidth of the data cache, depending on the data access patterns of the application. As the figure "Data Organization Boosts Performance" shows , an array of 100 structures can scatter data accesses throughout main memory, causing frequent flushing and reloading of the cache. A structure with 100-element arrays has these items occupying adjacent addresses, so any references to them will probably find the items in the cache.

The second strategy is to make use of processor-specific features. Most modern microprocessor instruction sets provide a mechanism for managing the data cache under software control. The PowerPC architecture contains several user-accessible instructions for manipulating the data cache that can significantly improve overall application performance (see the table "Data Cache Instructions" ).

It's important to define what a block is in this context. A block is the fundamental unit of memory that the cache works with. The cache handles all memory loads and stores using blocks. Block size can vary from processor to processor. For example, the PowerPC 601 uses 64-byte blocks, while the 603 and 604 u se 32-byte blocks.

Each of these instructions operates on a pair of general-purpose register operands, whose sum forms the effective address of the memory location(s) to be affected. The "block touch" ( dcbt ) and "block touch for store" ( dcbtst ) instructions are essentially hints to the processor that the addressed data block is to be loaded, or at least allocated, in the data cache. When placed appropriately ahead of the anticipated need for data from memory, the dcbt instruction can be used to bring data from memory (or a secondary cache) into the primary data cache, thus dodging the performance hit of a cache miss. The dcbtst instruction behaves in a similar manner except it gives the processor the additional hint that the corresponding memory location is going to be overwritten soon. Since the dcbt and dcbtst instructions do not actually modify memory (that is, they affect the contents and state of the data cache itself), there are no exceptions tak en if the address translation fails. If the requested block already resides in the data cache, these instructions are treated as no-ops by the processor.

The "block flush" ( dcbf ) and "block store" ( dcbst ) instructions force modified (or dirty) data out of the cache and back into memory. The primary difference between these two instructions in uniprocessor mode is that dcbf not only copies data back to memory (as does dcbst ), it also marks the corresponding cache block as invalid. Since both of these instructions can change data in memory, exceptions related to address translation are handled the same as they would be for a normal load instruction from the addressed memory location.

Perhaps the most unusual of the data cache manipulation instructions in the PowerPC architecture is "block set to zero" ( dcbz ). The dcbz instruction allocates a block of data in the cache and then initializes it to a series of zeroes; thus, it modifies data in the cache. The dcbz instruction can be a very powerful means of boosting performance when zeroing a large block of data. It also can lead to nonportable code if care is not taken. Since not all implementations of the PowerPC have the same data cache block size, the amount of memory actually zeroed using the dcbz instruction may vary from one processor to another. Thus, if you wish to zero a fixed-size data array, the number of dcbz instructions required varies between processors of differing cache block sizes. However, the cache block size can be determined once per application launch, prior to the first use of the dcbz instruction. A function similar to the pseudocode in the listing "Measuring Block Size" measures the data cache block size for PowerPC processors.

Instruction Impact

The instruction cache is often overlooked by programmers as a source of performance improvement, even though every executed instruction passes through this potential system bottleneck. Traditional compiler code optimizations such as function inlining and loop unrolling are frequently cited as a means of improving performance by exposing additional information to the compiler optimizer. However, if not used carefully, such transformations can have an impact on instruction cache performance by reducing the locality properties in the program code.

One means of getting the benefits of function inlining while avoiding lost instruction cache locality is to use profiling-directed feedback. This feedback guides the compiler in selecting only those performance-critical functions to be transformed into inline code. Profiling-directed feedback involves a two-step compilation process in which an application is first built with special instrumentation code that keeps track of where dynamic execution time is spent. Executing the instrumented application automatically gathers profiling statistics for one or more invocations of the application. This profiling data is fed back into th e compiler during a second compilation of the application, where it inlines key functions at only those sites where they are called frequently.

The profiling data also permits the compiler to rearrange basic blocks so that infrequently executed code gets moved to a different range of addresses than the most frequently executed code. For example, consider error-checking code: It is necessary to ensure an application's robust operation, yet it is rarely executed. Such code is often offset from the main flow of execution by a conditional language construct, such as a C if statement. The compiler can relocate such basic block(s) representing the body of the error-checking code elsewhere so that the main body of code can occupy the instruction cache.

The benefits of such an approach are applicable to a wide range of programs, particularly those in which execution time is not spent primarily within a few small program loops. Consider the speedups obtained using a combination of basic block restruct uring and cross-file function inlining for some well-known C applications that are part of the SPEC95 benchmark suite. The Gnu C compiler (gcc) and a Lisp interpreter (li) obtained speedups of greater than 20 percent when profiling-directed feedback was used to guide function inlining and basic block placement. Two other applications, an instruction-set simulator (m88ksim) and an object-oriented database transaction benchmark (vortex), each obtained speedups better than 10 percent.


Data Cache Instructions

dcbf          rA,rB      data cache block flush

dcbst         rA,rB      data cache block store

dcbt          rA,rB      data cache block touch

dcbtst        rA,rB      data cache block touch for store

dcbz          rA,rB      data cache block set to zero




Measuring Block Size

/* 
* Assumes mem[] is aligned on 256-byte boundary within a page, 
* and that 4 <= block_size <= 256 bytes as a mult
iple of 4 bytes.
*/
/* Set a 256 byte block of memory to all 1's */
    for ( i = 0; i < 64; i++ )
       mem[i] = 0xffffffff;
/* Execute a dcbz instruction at location mem[0] */
    __dcbz (mem, 0);
/* Place a "guard word" of 1's at end of 256-byte block */
    mem[64] = 0xffffffff;
/* Look for first non-zeroed word */
    for ( i = 0; mem[i] == 0; i++ )
       /* empty loop body */ ;
    block_size = i * 4;




Data Organization Boosts Performance

illustration_link (19 Kbytes)

An improper data arrangement ( at left ) has the processor flushing and reloading its on-chip caches.


Mike Phil lip is manager of PowerPC Compiler and Tools Development for Motorola. He can be reached at phillip@risc.sps.mot.com .

Up to the Core Technologies section contentsGo to previous article: Go to next article: Enhancing Netscape with Plug-InsSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network