pment cycles being what they are, developers nee
d to start considering now where and how these MMX instructions can boost the performance of their applications.
Planning Considerations
Before changing a line of code, the first thing you should do is profile your application.
Profiling
is the process of linking special libraries into your program or using system utilities to measure where your program spends most of its execution time.
Generally, you want to work on those code segments that are computationally expensive or that take a sizable percentage of the application-processing time. In multimedia and communications applications, such code sections typically include filters and speech-compression algorithms, video-display routines, and rendering routines.
In general, such routines consist of small, repetitive loops that operate on 8- or 16-bit integers. It is these routines that yield the greatest overall performance increase when converted to MMX-optimized code.
Such algorithms nee
d to be analyzed for their fit with MMX instructions. The MMX technology adds 57 new op codes, designed to do high-speed arithmetic, logical, and comparison operations on packed data. As mentioned above, these MMX instructions offer the best support for 8- and 16-bit integer data types.
In some cases, it's possible to improve an algorithm's performance by rewriting it to use MMX instructions. For example, suppose a multimedia algorithm uses integer data. The first step is to use a profiling tool to identify which parts of the algorithm consume the most processor cycles. Once such "hot spots" are identified, you rewrite these code sections to use MMX integer instructions.
Floating-Point or Integer?
If an algorithm employs floating-point data, you should determine why it was used. Floating-point math is typically employed for one of two reasons. The first is for performance, since floating-point multiplies are about three times faster than standard integer multiplies in the Pent
ium. The second reason is that the algorithm in question requires a large range or lots of precision in its results.
If the algorithm uses floating-point math to obtain better performance, then it's certainly a candidate for conversion to MMX integer code. On the other hand, if the algorithm requires the range or precision that floating-point data offers, further investigation must be done. Can the algorithm's data values be converted to integer while maintaining the required range and precision? If so, you might rework the algorithm to take advantage of the MMX instructions.
When writing MMX code, it's important to keep in mind that the processor aliases the 64-bit MMX registers over the 80-bit floating-point registers, as shown in the figure
"Pentium Registers Do Double Duty"
. This sleight of hand allows the addition of eight 64-bit, directly addressable MMX registers without adding any new processor states or compromising software compatibility.
Because the registers
are physically the same, however, you can't store both floating-point data and packed-integer data in the same register at the same time. In addition, there's a small amount of processor overhead (several tens of clocks) when switching between floating-point and MMX instructions. To keep this overhead from sapping application performance, don't intermix floating-point and MMX code at the instruction level. If an application frequently switches between floating-point and MMX instructions, then you should consider extending the period that the application stays in either the MMX instruction stream or the floating-point instruction stream; this procedure will better amortize the switching overhead.
Because floating-point convention specifies that the floating-point stack be cleared after use, it's important to clear the MMX registers before issuing a floating-point instruction. The EMMS instruction is designed for just this purpose; it clears the MMX registers and sets the value of the floating-point tag
word to empty (i.e., all 1s). This instruction is the MMX technology's equivalent of popping floating-point values off the stack to leave it empty. The EMMS instruction should be inserted at the end of all MMX code segments to avoid a floating-point overflow exception.
When writing an application that uses both floating-point and MMX instructions, use the following guidelines for best results.
--
Partition the MMX instruction stream and the floating-point instruction stream into separate segments.
--
Exit the MMX code section with the floating-point tag word
empty (via the EMMS instruction).
--
Leave the floating-point code section with an empty stack.
--
Don't rely on the contents of the MMX or floating-point
registers across context switches.
Data Alignment
Data alignment is critical to optimal performance on Intel processors. Misaligned accesses add costly extra clock cycles
to data-access times and sap performance. To see why this is so, see the figure
"Misaligned Data Wastes Cycles"
. If, say, a 16-bit integer value straddles a 4-byte boundary, it triples the number of cycles required to access the data.
This problem is easily solved by simply respecting data alignment. Many compilers let you specify the alignment of variables using compiler controls. If a manual alignment of the variables is required, typically when allocating memory blocks on the fly, you can use the following C algorithm to force alignment. This routine aligns a 64-bit variable on a 64-bit boundary. Once it's aligned, every access to this variable saves three clock cycles (versus an unaligned access) on a Pentium processor.
if (NULL == (new_ptr = malloc
(new_value +1)* sizeof
(var_struct))mem_tmp =
new_ptr;mem_tmp /= 8;
new_tmp_ptr = (var_struct*)
((Mem_tmp+1) * 8);
As a matter of convention, compilers allocate anything that's not declared
stat
ic
on the stack. When making use of such volatile 64-bit data elements, it's important to ensure that the stack is aligned. The C code in the listing
"Maintaining Stack Alignment"
, when placed in the function's prologue and epilogue, can force stack alignment.
As you can see, using MMX technology to speed program execution is fairly straightforward. More on MMX technology, instructions, and coding techniques can be found at the uniform resource locator (URL)
http://www.intel.com
.