Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesOptimizing For Today's CPUs


February 1994 / State Of The Art / Optimizing For Today's CPUs

Modern compilers employ a full bag of tricks to make programs execute efficiently and to take full advantage of the special abilities of the new superscalar processors

Alex Lane

Traditionally, optimizing code has been straightforward. Compilers improved software efficiency by reducing the number of instructions executed, making better use of the CPU's instruction set, maximizing the use of registers, minimizing memory references, and eliminating unused or redundant code. Currently, developers are creating optimization techniques to take maximum advantage of new processor features, such as on-chip caches, pipelining, and superscalar chip architecture.

The newest round of advanced microprocessors presents special challenges to compiler developers. For processor architectu res built around two integer pipelines--such as Intel's Pentium, DEC's Alpha AXP, and Hewlett-Packard's PA-RISC--the compiler must schedule instructions to keep both pipelines full. Mips Technologies' Mips III (as implemented on the R4000 and R4400 chips) uses a number of greater parallel operations, so compilers for it must keep the superpipelines from stalling. With the IBM/Apple/Motorola PowerPC, the principal challenge is to get the maximum number of free instructions per clock cycle, and with the PowerPC 601 in particular, to give branch-prediction hints.

Goals of Optimization

Compilers generate code in a fairly mechanical way; a specific syntax generates a specific series of instructions. Unfortunately, this mechanical mapping process frequently generates sequences of assembly language instructions that contain redundant instructions or that perform simple operations in an efficient and time-consuming way.

Enterprising writers of early compilers incorporated so-called peephole routines that examined generated code as it was written to an output file, replacing awkward sequences with more efficient instructions. These techniques are known as optimizations.

While the goal of optimization is to make code more compact, run faster, or both, optimization techniques can do only so much in the face of poorly organized programs or inherently slow algorithms. To address these problems, you can use a profiler utility that shows which parts of the code the CPU spends the most time executing. This identifies coding bottlenecks that you can often fix by using a different task organization, coding approach, or algorithm. Beginning developers often wonder why they need to use a profiler utility if they use a code optimizer, and vice versa. In fact, the two tools perform different and complementary functions.

A beneficial side effect of optimization is the ability to write more readable code. Before the widespread use of optimizing compilers, experienced programmers often adopted coding styles that favored better machine code generation over source code readability and maintainability. For example, multiplying a variable by a power of 2 in C might be coded as var less than less than = 3 instead of var * 8. Alternatively, you could manually insert values for constants and expressions containing constants or use additional temporary variables to take calculations out of loops. Optimizers make such tactics unnecessary, freeing the developer to write more natural, readable code.

Optimization is an analytical process. After being parsed and analyzed, code is examined to extract the maximum amount of information, which can then be used to improve the code while preserving the programmer's original intent. Analysis answers the following questions: Does an expression inside a loop ever change during the loop's lifetime? Are there particular sections of code that will never be executed? Does this variable's value ever change? Is it ever used?

In typical cases, the greater the scope of optimiz ation, the longer a compiler will take to generate code. Peephole optimizations use relatively little information and operate on a statement-by-statement level. You can get more information by examining blocks of code and identifying relationships between variables and expressions, although this takes longer. Global optimization looks at procedures and loops, and takes even more time.

The complexity of the optimization also affects overall code-generation speed. In any particular situation, it takes far less time to determine where to use arithmetic simplification--a technique in which certain expressions are replaced by simpler equivalents--than to analyze whether advanced optimizations, such as interchanging nested loops or reordering statements, can be used.

Some optimizations can be used with any processor, while others are tailored to specific chips. Machine-independent optimizations focus on generic code improvements, such as eliminating common subexpressions, making loops more efficient, and getting rid of dead code and uneeded variables. These generic optimization techniques need no knowledge of the hardware. Nearly all industrial-strength compilers today offer a full suite of such optimizations.

Machine-Dependent Optimizations

Unlike the optimizations discussed so far, machine-dependent optimizations take advantage of processor-specific knowledge to improve execution time and reduce overhead. They are particularly important for the new generation of processors, where new architectures require specific new approaches for optimization. For the most part, these kinds of optimizations represent improvements that are beyond the control of a developer who codes in a high-level language.

One key to generating good code is keeping CPU registers filled with needed values, as opposed to shuttling the values between memory and the CPU. This is the idea behind global register allocation, which is a size and speed optimization that figures out what values should be stored in registers. The longer you can keep values in registers and prevent them from being written out to on-chip cache memory or, worse, to external memory, the better performance you get.

Register optimizations are particularly valuable in RISC architectures, which typically provide a large number of registers in the CPU. A similar optimization technique is register parameter passing, which bypasses the need to push function parameters onto the stack before the function call is made, by making sure that all needed parameters are preloaded into CPU registers.

Unfortunately, these techniques are of limited value in processors that have only a few registers. In the case of register parameter passing, if the function call does not result in direct execution of the function (a typical situation in environments like Windows), the technique cannot be used.

Loop compaction on most 80x86 processors is a machine-dependent optimization that takes advantage of highly efficient string-move instructions. A typical exam ple with both C source code and the resulting assembly language instructions is shown in the listing "Loop Compaction." Surprisingly, the stosw instruction that improves 386 code will execute significantly more slowly on 486 and Pentium processors.

In pipelined and superscalar processors, techniques that keep instructions executing while minimizing memory paging and register swapping operations are important. For example, consider a loop in C written as

int x[ROW][COL];

for (j=0;j less than ROW;i++)

for (i=0;i less than COL;j++)

x[i][j]=foo(bar);

For sufficiently large values of ROW and COL, this suffers from widely separated addresses--called a stride in some circles--for successive values of x[i][j]. Running this code will likely result in a lot of memory paging. However, just by interchanging these nested loops, successive values of x[i][j] will be near one another and will reduce the number of paging operations necessary to run the code.

Another commonly used op timization is statement reordering. Here, computations are done in a different sequence than specified in the source code. When applied to floating-point calculations, this optimization is likely to produce slightly different results than the originally specified code. This is due to the accumulation of round-off errors that differ from those you would obtain using the original code.

An interesting technique called strip mining isn't an optimization in itself, but it can be used with other techniques to boost performance. The term is taken from a technique used with supercomputers.

Strip mining is particularly useful for optimizing matrix multiplication on cache-based, superscalar processors such as the Alpha, Pentium, PowerPC, or PA-RISC chips. The key lies in transforming the ponderous row-column algorithm, which (if matrix C is an n ´ n product of matrices) looks like this:

for (i=0; i less than n; i++)

for (j=0; j less than n; j++)

for (k=0,C[i,j]=0; k less than n; k++)

C[i][j] += A[i][k]*B[k][j];

into one that effectively multiplies square subblocks of the matrices. The listing "Strip Mining" shows what the optimized code might look like.

In the sample code, blk_factor is a processor-dependent number selected to make best use of the CPU's cache that, in effect, causes the matrices to be multiplied in blocks of blk_factor size. (The variable init_strip_sz is used to multiply a smaller block if the size of the matrices being multiplied is not evenly divisible by blk_factor.)

In creating this code, the optimizer strip-mined each of the original loops to create block loops and strip loops. The block loops, which cause blocks of memory to be paged into and out of the cache, were interchanged to move them to the "outside" of the nest of loops. The strip loops, which perform calculations within each block, were moved to the inside of the nested loops.

Arranged in this way, the computation makes best use of the processor and memory system, since you can use cached matrix blocks repeatedly before discarding them and intermediate results can be enregistered longer than otherwise possible. The result is nearly an order of magnitude improvement in performance for n greater than 150.

However, the optimized code is obviously much more complicated than the original. Further, the size of blk_factor and the ordering of the loops can vary from processor to processor. (In this example, the transformed code was generated for an IBM RS/6000, which has a 64-KB cache and four-way set associativity, using KAP/C from Kuck and Associates of Champaign, Illinois.) For this reason--and from a readability standpoint--developers should write code as shown in the original and let the optimizer take care of the details.

Another important optimization for pipelined architectures is instruction scheduling. This helps prevent processor pipelines from stalling. In Pentium and 486 processors, for example, if an address or register needed by a particular instruction is not a vailable because a previous instruction has not finished processing, an AGI (Address Generation Interlock) occurs. In the following code example, an AGI occurs after the second instruction because the first has not completed execution:

inc eax

mov ebx,[eax] ; stalled owing

to AGI

inc ecx

A good instruction scheduler would rearrange the instructions as follows, to allow uninterrupted execution:

inc eax

inc ecx

mov ebx,[eax]

AGIs are particularly troublesome for Pentium processors, since two instructions can execute in parallel in the final three stages of the integer pipeline, in separate U- and V-pipes, creating more stalling opportunities in longer instruction sequences. In the following example,

;U-pipe V-pipe

inc ebx inc ecx

inc edx mov eax,[ebx]

;stalled owing to AGI

mov esi,[mem]

an AGI occurs after the second instruction is encountered in the V-pipe. This halts execution in both pipes for one clock cycle unt il the first instruction completes. Properly scheduled, the code would look as follows:

;U-pipe V-pipe

inc ebx inc ecx

inc edx mov esi,[mem]

mov eax,[ebx]

To Optimize or Not to Optimize?

There are two good reasons why you might not want to always run the optimizer when compiling code. The first is time. It's not unusual for compilers with aggressive optimizers to increase code generation time by 200 percent to 300 percent, and this figure may rise for compilers that implement advanced, machine-dependent optimizations.

The second reason concerns debugging. Traditional source-level debuggers rely on a correspondence between the source code and the object code. When you introduce moved code, compacted (or worse, interchanged) loops, eliminated variables, inlining, and so on, optimized code often bears little resemblance to the source code from which it came. This makes debugging optimized code nearly impossible, except at the assembly language level.

Vari ous approaches to debugging optimized code have been tried, such as noting what line of original source code generated particular instructions, and others are under development. Currently, though, the standard approach to software development is to debug using unoptimized code, and only then to run the optimizer to generate the final executable. Any bugs that are introduced during optimization must be chased down at the assembly language level.

As the complexity of processors increases, today's optimizing compilers are becoming more complex. In addition to providing the traditional machine-independent optimizations that mimic what a determined developer might do to make code faster or smaller, the compilers are beginning to perform machine-dependent optimizations that promise quantum boosts in performance.


Listing: Loop Compaction



; for (i=0;i less than 10;i++)
;  x[i]=j+k;
;
mov cx, 10                    ;load cx w/ no. of iterations
lea di,word ptr [bp-108]      ;loa
d di with adr of x[0]
push ds
pop es                        ;load es with data segment
mov ax,word ptr [bp-6]        ;get value of n
add ax,word ptr [bp-8]        ;add value of m
rep stosw                     ;ZOOM!
mov word ptr [bp-2],10        ;set i to 10




Listing: Strip Mining



int i_blk, i_strip_sz, i_strip_lb, i_strip_ub;
int j_blk, j_strip_sz, j_strip_lb, j_strip_ub;
int k_blk, k_strip_sz, k_strip_lb, k_strip_ub;


i_strip_lb = 0;
i_strip_sz = init_strip_sz;


for (i_blk = 1; i_blk less than = n; i_blk += blk_factor ) {
   i_strip_ub = i_strip_ub + i_strip_sz - 1;
   k_strip_lb = 0;
   k_strip_sz = init_strip_sz;
   for (k_blk = 1; k_blk less than = n; k_blk += blk_factor ) {
      k_strip_ub = k_strip_ub + k_strip_sz - 1;
      j_strip_lb = 0;
      j_strip_sz = init_strip_sz;
      for (j_blk = 1; j_blk less than = n; j_blk += blk_factor ) {
         j_strip_ub = j_strip_ub + j_strip_sz - 1;
         for (i = i_strip_lb; i less than
 = i_strip_ub; i++)
            for (j = j_strip_lb; j less than = j_strip_ub; j++)
               for (k = k_strip_lb; k less than = k_strip_ub; k++)
                  C[i][j] += A[i][k] * B[k][j];
         j_strip_lb += j_strip_sz;
         j_strip_sz = blk_factor;
      }
      k_strip_lb += k_strip_sz;
      k_strip_sz = blk_factor;
   }
   i_strip_lb += i_strip_sz;
   i_strip_sz = blk_factor;
}


Optimization Techniques



Machine-dependent
-- Global register allocation
-- Register parameter passing
-- Loop compaction
-- Statement reordering
-- Strip mining
-- Instruction scheduling


Machine-independent
-- Constant propagation (also known as copy propagation)
-- Common subexpression elimination
-- Invariant code motion
-- Dead code elimination
-- Dead store elimination
-- Redundant load suppression
-- Variable induction
-- Loop jamming (often called loop fusion)
-- Loop unrolling
-- Loop rerolling
-- Inlining
-- Loop peeling




Alex Lane is a Colorado-based writer, speaker, and consultant. He can be reached on the Internet or BIX at a.lane@bix.com .

Up to the State Of The Art section contentsGo to previous article: Compiler Benchmarks: How Useful?Go to next article: Optimizing with Pre- and Post-CompilersSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network