Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesProgramming Strategies for Intel's MMX


August 1996 / Core Technologies / Programming Strategies for Intel's MMX

A guide to using the Pentium's new multimedia instructions.

Jonathan Khazam and Bev Bachmayer

Developing fast applications for Intel x86 processors is, in general, not difficult. However, an understanding of the processor's architecture makes the difference between a fast application and a slow one.

Intel processors that offer MMX technology add a new dimension to code development. The MMX technology is a set of highly optimized instructions for multimedia tasks that's included in Pentium processors scheduled to ship later this year. (For more on MMX, see "x86 Enters the Multimedia Era," July BYTE.) Software develo pment cycles being what they are, developers nee d to start considering now where and how these MMX instructions can boost the performance of their applications.

Planning Considerations

Before changing a line of code, the first thing you should do is profile your application. Profiling is the process of linking special libraries into your program or using system utilities to measure where your program spends most of its execution time.

Generally, you want to work on those code segments that are computationally expensive or that take a sizable percentage of the application-processing time. In multimedia and communications applications, such code sections typically include filters and speech-compression algorithms, video-display routines, and rendering routines.

In general, such routines consist of small, repetitive loops that operate on 8- or 16-bit integers. It is these routines that yield the greatest overall performance increase when converted to MMX-optimized code.

Such algorithms nee d to be analyzed for their fit with MMX instructions. The MMX technology adds 57 new op codes, designed to do high-speed arithmetic, logical, and comparison operations on packed data. As mentioned above, these MMX instructions offer the best support for 8- and 16-bit integer data types.

In some cases, it's possible to improve an algorithm's performance by rewriting it to use MMX instructions. For example, suppose a multimedia algorithm uses integer data. The first step is to use a profiling tool to identify which parts of the algorithm consume the most processor cycles. Once such "hot spots" are identified, you rewrite these code sections to use MMX integer instructions.

Floating-Point or Integer?

If an algorithm employs floating-point data, you should determine why it was used. Floating-point math is typically employed for one of two reasons. The first is for performance, since floating-point multiplies are about three times faster than standard integer multiplies in the Pent ium. The second reason is that the algorithm in question requires a large range or lots of precision in its results.

If the algorithm uses floating-point math to obtain better performance, then it's certainly a candidate for conversion to MMX integer code. On the other hand, if the algorithm requires the range or precision that floating-point data offers, further investigation must be done. Can the algorithm's data values be converted to integer while maintaining the required range and precision? If so, you might rework the algorithm to take advantage of the MMX instructions.

When writing MMX code, it's important to keep in mind that the processor aliases the 64-bit MMX registers over the 80-bit floating-point registers, as shown in the figure "Pentium Registers Do Double Duty" . This sleight of hand allows the addition of eight 64-bit, directly addressable MMX registers without adding any new processor states or compromising software compatibility.

Because the registers are physically the same, however, you can't store both floating-point data and packed-integer data in the same register at the same time. In addition, there's a small amount of processor overhead (several tens of clocks) when switching between floating-point and MMX instructions. To keep this overhead from sapping application performance, don't intermix floating-point and MMX code at the instruction level. If an application frequently switches between floating-point and MMX instructions, then you should consider extending the period that the application stays in either the MMX instruction stream or the floating-point instruction stream; this procedure will better amortize the switching overhead.

Because floating-point convention specifies that the floating-point stack be cleared after use, it's important to clear the MMX registers before issuing a floating-point instruction. The EMMS instruction is designed for just this purpose; it clears the MMX registers and sets the value of the floating-point tag word to empty (i.e., all 1s). This instruction is the MMX technology's equivalent of popping floating-point values off the stack to leave it empty. The EMMS instruction should be inserted at the end of all MMX code segments to avoid a floating-point overflow exception.

When writing an application that uses both floating-point and MMX instructions, use the following guidelines for best results.


--
 Partition the MMX instruction stream and the floating-point instruction stream into separate segments.

--
 Exit the MMX code section with the floating-point tag word
      empty (via the EMMS instruction).

--
 Leave the floating-point code section with an empty stack.

--
 Don't rely on the contents of the MMX or floating-point
      registers across context switches.

Data Alignment

Data alignment is critical to optimal performance on Intel processors. Misaligned accesses add costly extra clock cycles to data-access times and sap performance. To see why this is so, see the figure "Misaligned Data Wastes Cycles" . If, say, a 16-bit integer value straddles a 4-byte boundary, it triples the number of cycles required to access the data.

This problem is easily solved by simply respecting data alignment. Many compilers let you specify the alignment of variables using compiler controls. If a manual alignment of the variables is required, typically when allocating memory blocks on the fly, you can use the following C algorithm to force alignment. This routine aligns a 64-bit variable on a 64-bit boundary. Once it's aligned, every access to this variable saves three clock cycles (versus an unaligned access) on a Pentium processor.

if (NULL == (new_ptr = malloc
 (new_value +1)* sizeof
 (var_struct))mem_tmp = 
 new_ptr;mem_tmp /= 8;
 new_tmp_ptr = (var_struct*)
 ((Mem_tmp+1) * 8);

As a matter of convention, compilers allocate anything that's not declared stat ic on the stack. When making use of such volatile 64-bit data elements, it's important to ensure that the stack is aligned. The C code in the listing "Maintaining Stack Alignment" , when placed in the function's prologue and epilogue, can force stack alignment.

As you can see, using MMX technology to speed program execution is fairly straightforward. More on MMX technology, instructions, and coding techniques can be found at the uniform resource locator (URL) http://www.intel.com .


Maintaining Stack Alignment

Prologue:
 push ebp ;save old frame ptr
 mov ebp, esp ;make new frame ptr
 sub ebp, 4 ;make room for stack ptr
 and ebp, 0FFFFFFFC ;align to 64 bits
 mov [ebp],esp ;save old stack ptr
 mov esp, ebp ;copy aligned ptr
 sub esp, FRAMES
IZE ;allocate space

. . . callee saves state, etc.

Epilogue:

. . . callee restores state, etc.
 mov esp, [ebp]
 pop ebp
 ret




Pentium Registers Do Double Duty

illustration_link (15 Kbytes)

Register aliasing prevents you from using MMX and floating-point instructions at the same time.


Misaligned Data Wastes Cycles

illustration_link (14 Kbytes)

Misaligned data can triple the number of processor cycles required to fetch data.


Jonathan Khazam is the program manager for Intel's MMX technology program. Bev Bachmayer is a senior programmer in Intel's code-optimization group. You can contact them at Jonathan_Khazam@ccm.sc.intel.com and at Bev_Bachmayer@ccm.imu.intel.com , respectively.

Up to the Core Technologies section contentsGo to previous article: SearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network