Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

Articlesx86 Enters the Multimedia Era


July 1996 / Core Technologies / x86 Enters the Multimedia Era

Intel's new MMX instructions bring faster multimedia processing to x86-compatible CPUs.

Tom Halfhill

MMX is the most significant revision of the x86 architecture since Intel introduced the 32-bit 386 chip in 1985. Programmers get eight new registers and 57 new instructions that are optimized for multimedia tasks. Users will get better performance with video, graphics, animation, and sound. Yet the new MMX-enabled CPUs will be compatible with existing x86 software and should cost about the same as regular x86 processors without MMX technology.

Intel says it will ship the first MMX Pentium (code-named the P55C) in the fourth quarter. Next year, Intel plans to integrate MMX into all its new x86 chips, including the si xth-generation Pentium Pro. By 1998, MMX will probably be as i ntegral to the x86 architecture as the extended 32-bit instructions that Intel added to the 386 more than a decade ago.

Other x86 vendors are adopting MMX, too. Thanks to a recent cross-licensing agreement, future x86 processors from Advanced Micro Devices and AMD's subsidiary, NexGen, should be compatible with Intel's P55C. Cyrix, another x86 vendor, has not yet licensed MMX. However, Cyrix maintains that its future CPUs will be fully compatible with MMX, either by licensing or reverse-engineering the Intzel technology.

Numerous software companies have announced support for MMX in upcoming versions of their products. These include key development tools such as Microsoft Visual C++, Watcom C++, Macromedia Director, and Criterion RenderWare. Microsoft says it will support MMX in its new Direct3D and ActiveMovie APIs.

Inside MMX

Adding new instructions to a microprocess or is easy: Define the new opcodes and add the necessary logic. But adding new instructions without disrupting software compatibility is another matter. It's a particular challenge with the x86 because backward compatibility isn't just advisable; it's mandatory.

Intel mapped the eight new MMX registers into the existing stack of floating-point (FP) registers. There are eight general-purpose FP registers in an x86 FPU, and each one is 80 bits wide. FP values use 64 bits for the mantissa and 16 bits for the exponent. MMX instructions use those 80-bit registers as a random-access file (not a push-pull stack) of eight 64-bit registers. In other words, MMX instructions use only the 64-bit mantissa portion of an FP register to store MMX operands.

This trick gives programmers the virtual equivalent of eight new registers without radically altering the standard x86 architecture. OS vendors don't have to modify their code to save the state of MMX registers during context switches--MMX registers look like ordinary FP registers to the OS. Clever, eh?

But there's a catch. Programmers can use MMX and FP instructions in the same program, but they'd better not mix them because both kinds of instructions need the same registers. When a program finishes a sequence of MMX instructions, it must clear the registers with a new instruction (EMMS: Empty MMX state) to make way for subsequent FP instructions. FP instructions do likewise when they pop values off the FP stack and set the registers' tag bits. If a program mixes FP and MMX instructions, it will pay a performance penalty for these register-level "context switches."

Generally, though, it shouldn't be a problem. Developers of multimedia should segregate MMX instructions in a subroutine or library that's called only after probing the chip's CPU_ID to verify that it supports MMX. It makes sense to group MMX instructions into tight routines, anyway, because multimedia processing typically involves repetitive operations on long sequences of data.

Packed Operands

Even though MMX instructions use FP registers, they're all integer-type instructions. Their 64-bit operands may contain eight packed bytes, four packed 16-bit words, two packed 32-bit doublewords, or a single 64-bit quadword.

Potentially, an MMX instruction could manipulate an 80-bit packed operand if it used a whole FP register. But Intel limited the operands to 64 bits because they match the Pentium's 64-bit I/O bus and internal data paths. Also, 80 isn't an even power of 2 in binary, so it's more troublesome to handle.

As it is, the 64-bit operands are plenty long enough for typical multimedia jobs. Suppose a program is manipulating graphics in 8-bit color, which is often the case in games. An MMX instruction can pack eight pixels into a single operand and process them all at once. An ordinary x86 CPU can shuffle only one pixel at a time. Audio and communications programs often use 16-bit data types, so a single MMX instruction can process four of those values in a single chunk.

Most MMX instructions follow this pattern of performing a single operation on a series of integer values. This technique is called single instruction, multiple data (SIMD), and it lends itself to the algorithms and data types frequently found in multimedia software. Examples include MPEG compression, wavelet compression, motion compensation, motion estimation, color space conversion, texture mapping, 2-D filtering, matrix multiplication, fast Fourier transforms, discrete cosine transforms, and phoneme matching.

Something else these processes have in common is a lot of potential parallelism. It's no coincidence that MMX instructions are integer operations; they're designed to exploit these characteristics. Like most other integer operations in a modern x86, the majority of MMX instructions can execute in a single cycle. MMX multiplication instructions require three cycles to execute, but the CPU can issue a new one every cycle.

Therefore, a superscalar CPU like th e Pentium can execute multiple streams of MMX instructions in its parallel integer pipelines. An out-of-order CPU like the Pentium Pro can rearrange MMX instructions for maximum efficiency. The CPU doesn't need a special multimedia execution unit for MMX, so any advances that improve integer performance will benefit MMX performance as well.

One thing you won't find in the MMX instruction set is branch instructions. Branches would disrupt the instruction flow, and mispredicted branches would stall the pipelines--a particular hazard in the superpipelined Pentium Pro. Instead, there are new conditional-select instructions that perform logical operations on multiple operands. By using masks and bitwise comparisons, these instructions can achieve the same results as branches without the delays.

On balance, it appears that Intel has achieved its goal of updating the x86 to meet the demands of modern software without jeopardizing compatibility. Intel could have squeezed out more p erformance by making more radical changes--for example, by adding new MMX-specific registers instead of aliasing the FP stack--but such changes would slow down the adoption of MMX. The last time Intel extensively revised the x86 architecture was 11 years ago, and most PC users are only now making the transition to 32-bit software. Intel wants MMX to catch on a little faster.


Where to Find


Intel

Santa Clara, CA
Phone:    (408) 765-8080
Internet: 
http://www.intel.com


HotBYTEs
 - information on products covered or advertised in BYTE


What MMX Adds to Intel Instructions



Opcode
 
Type
          
Mnemonic
             
Description



Arithmetic
           PADD[B,W,D]..........Packed add with wraparound on [byte, word, doubleword]
                     PADDS[B,W]...........Packed add signed with saturation on [byte, word]
                     PADDUS[B,W] .........Packed add unsigned with saturation on [byte,word]
                     PSUB[B,W,D] .........Packed subtract with wraparound on [byte, word, doubleword]
                     PSUBS[B,W] ..........Packed subtract signed with saturation on [byte,word]
                     PSUBUS [B,W] ........Packed subtract unsigned with saturation on [byte,word]
                     PMULHW...............Packed multiply high on words
                     PMULLW...............Packed multiply low on words
                     PMADDWD..............Packed multiply on words and add resulting pairs


Comparison
           PCMPEQ[B,W,D]........Packed compare for equality [b
yte, word, doubleword]
                     PCMPGT[B,W,D]........Packed compare greater than [byte, word, doubleword]

Conversion           PACKUSWB.............Pack words into byte (unsigned saturation)
                     PACKSS[WB,DW]........Pack [words into bytes, doublewords into words] 
                                          signed with saturation
                     PUNPCKH[BW,WD,DQ]....Unpack high-order [bytes, words, doublewords] 
                                          from MMX register
                     PUNCKL[BW,WD,DQ].....Unpack low-order [bytes, words, doublewords] 
                                          from MMX register


Logical
              PAND.................Packed bitwise AND
                     PANDN................Packed bitwise AND NOT
                     POR..................Packed bitwise OR
                     PXOR.................Packed bitwise XOR


Shift
                PSLL[W,D,Q]..........Packed shift left logical
 [word, doubleword, quadword] 
                                          by MMX register or immediate value
                     PSRL[W,D,Q]..........Packed shift right logical [word, doubleword, quadword] 
                                          by MMX register or immediate value
                     PSRA[W,D]............Packed shift left arithmetic [word, doubleword] 
                                          by MMX register or immediate value


Data
 
transfer
        MOV[D,Q,]............Move [doubleword, quadword] 
                                          to or from MMX register


FP/MMX
 
State
         EMMS.................Empty MMX state




How MMX Does Chromakeying Without Branching

illustration_link (34 Kbytes)

Complex multimedia processing can be done without code branches by using MMX instructions.


Tom R. Halfhill is a BYTE senior editor based in San Mateo, California. You can reach him at thalfhill@bix.com .

Up to the Core Technologies section contentsGo to previous article: Go to next article: Diagnosing Token-Ring AilmentsSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network