The VideoRISC Compression Architecture enables real-time MPEG 1 and 2 video encoding and decoding
Peter Wayner
One of the most challenging feats for any desktop computer is the successful display of digital-video images from sources such as CD-ROM, the airwaves, or a LAN-based video conference. Full-motion video leaves no room for pauses or glaring errors. An operating system may take a few seconds to start up a program or write out a file to disk, but full-motion video needs to hit the screen 30 frames per second, every second.
The newest entry in the mad dash for digital video is a scheme by C-Cube Microsystems (Milpitas, CA) dubbed VideoRISC Compression Architecture. The heart of VideoRISC is the VideoRISC Compression Processor, or VCP. It can compress and decompres
s video signals fast enough for you to enjoy full-screen, real-time video on your computer. Before, you had to rely on expensive, dedicated hardware for this level of video quality or sacrifice resolution, the number of colors, or the frame rate. Most likely, you'd compromise on all three.
The VCP will allow vendors to scale both the capability and the price of video hardware. For example, it will allow easier implementation of videoconferencing at the high end. At the low end, it will allow CD-ROM drives to display high-quality animations in real time, a feat that their limited bandwidth makes impossible while using uncompressed video. (At 640- by 480-pixel resolution and 24 bits per pixel, you require a bandwidth of over 27 MBps to handle real-time video. Double-speed CD-ROM players deliver 300 KBps.)
But effective video compression has many other applications as well. With VCP, cable companies can fit 50 times as many channels on their digital networks. Satellites can handle 50 times as many
signals. The market for other machines, such as boxes that decompress video signals from your cable company, could be substantially larger. Given the potential size of these markets, it is quite possible that the VCP could become more important than microprocessors such as the 486.
Starting with Standards
The most popular method for compressing video signals is MPEG, a derivation of the popular JPEG standard used to compress and decompress still images. MPEG 1 handles SIF (source input format) resolution signals of 360 by 240 pixels, while MPEG 2 handles broadcast-quality 720- by 480-pixel signals. When linked in parallel, VCPs can encode such signals in real time. It takes two VCPs to encode real-time MPEG 1, eight to encode MPEG 2.
MPEG compresses consecutive frames by making the first frame a reference frame. It then finds the difference between this frame and the rest of the frames and compresses this difference.
MPEG computes the difference by breaking the frame into 8- by 8-pixel
blocks and searching for the best match for these pixels in the reference frame. It compresses the difference using a technique called DCT (Discrete Cosine Transform), which is similar to the one used in JPEG. Once computed, the coefficients are then Huffman-coded to produce the final signal that is often one-tenth to one-twentieth the size of the original.
MPEG includes several important functions that are difficult to implement on a general-purpose CPU. When each 8- by 8-pixel block is compared to the reference frame, the best match may not be in the corresponding location, because objects often move across the screen. To get high-compression ratios, MPEG needs to take advantage of this redundant data even though it has moved in relation to the reference frame. It uses a computationally intensive search procedure to find such redundancies. Unlike general-purpose CPUs, the VCP has a special functional unit devoted to this search. It also has a functional unit dedicated to the Huffman coding that form
s the last step in the MPEG algorithm.
Chip Basics
At the core of the VCP is a RISC microprocessor that runs a small, embedded operating system. Even though you could run many different jobs on this processor (including most software for basic machines), the structure is tuned to the MPEG algorithms.
The internal architecture of the RISC core is similar in some respects to that of many of the DSPs (digital signal processors) on the market. DSPs are popular for sound processing--which is like video processing, an analog encoding/decoding chore--so the similarity should not be surprising. The Fourier transform that DSPs use to generate reverberation or other sonic novelties is similar to DCT.
The VCP chip can function as both a general CPU and a DSP at the same time. The backbone of the chip is the processing pipeline, which forks where the processing path splits into a RISC half and a DSP half. All instructions are preprocessed in a similar way in the first part of the pipeline. After t
he split, however, standard arithmetic instructions flow down one fork, while DSP-specific instructions flow down the other.
The four initial stages that process all instructions include Fetch 1, where the instruction is retrieved from the cache; Fetch/Jump, where the fetch is completed and a jump is executed if the instruction is a jump; Read/Decode, where the operands from the registers are retrieved and the instruction is decoded; and Execute, where instruction execution begins.
The simple arithmetic instructions (e.g., addition, subtraction, AND, OR, XOR, and arithmetic shifts) complete in the Execute stage and move to a Writeback stage. The more complicated DSP instructions move from the Execute stage to the DSP fork of the pipeline, which uses three stages to complete the instructions.
The branch of the pipeline used for the complicated instructions is where most of the VCP's power lies. The canonical DSP instruction, the MAC (multiply/accumulator) instruction, is where two numbers
are multiplied together and added to an accumulator register. MAC operations are frequently used in signal processing, and DSP designers concentrate on making them as fast as possible. In many cases, the small, tight loops of DSP programs repeat MAC codes many times to find a large sum. The VCP is optimized for these computations.
In addition to optimizing a MAC instruction, the VideoRISC includes many functions not found in general-purpose DSP chips, which are required by the MPEG algorithms. For example, one command computes the spatial frequency of 8 bytes by finding the sum of the squares of the differences between pairs of the bytes. This is an integral part of the DCT. A normal processor would be slowed down because splitting the two 32-bit quantities that the memory systems delivers would probably take the same amount of time as the actual computation.
Another set of instructions averages two different 32-bit quantities in a variety of ways. One instruction will find the average of two 32
-bit numbers; another will split the 32-bit words into half-words and find the average of four 16-bit numbers; and a third will average the 8 bytes. All these extra instructions prove to be very useful in computing the DCT.
Although the VCP has many complex computational instructions, it still qualifies as a RISC core because the extra instructions can only access the registers. They can't load information directly from the memory for their operation. This means that a compiler (or the machine-language programmer) can still rearrange the loads and the computations so that there are a minimum of conflicts.
The Motion Estimator
Estimating motion, or changes, from frame to frame is one of the most common bottlenecks in the MPEG compression routines. The algorithm looks for sections of the screen that move from one position to another between frames. This small amount of motion is present whenever a camera pans across a scene or when a person or object moves across the background.
The moti
on estimator is essentially another processor that runs on its own. Its basic function is to take a rectangle of pixels in one frame of the video and compare it to a reference frame to find the change in horizontal and vertical position that will make the best match. The quality of the match is judged by positioning the rectangle over each possible displacement in the reference frame and summing the differences between the pixels that overlap. If an exact match is found, there will be no difference between the source pixels and the ones in the reference frame, and the sum will be zero.
The programmer can set the range of this search procedure to a flexible area of the reference frame. The chip can also calculate the best displacement in half-pixel increments, because it has the ability to interpolate between neighboring frames.
Once the motion estimator receives the coordinates of the two frames and their location in memory, it finds the best displacement estimation. When done, it will raise an
interrupt, and the main CPU will be able to get the right solution from the register. The half-pixel interpolation is done in a special part of the motion estimator, not with the averaging functions in the main CPU.
To overcome performance bottlenecks involved in accessing main memory, the motion estimator has its own memory that holds a 16- by 32-pixel subset of the reference frame and a 32- by 8-pixel subset of the frame being compared. The MPEG algorithm itself compares 8- by 8-pixel blocks of data to all possible displacements in a 40- by 24-pixel block of the reference frame. To implement this function, the VCP performs a number of comparisons concurrently. It loads four blocks of the frame being processed into the 32- by 8-pixel memory and eight blocks of the reference frame into the 16- by 32-pixel memory. The four blocks are then compared against the reference frame memory, and the best result is stored in a register.
The search then proceeds as the VCP loads in a new 16- by 32-pixel blo
ck of the reference frame and compares the four blocks to this block from the reference frame. If any of the blocks find a better match in this region, the better displacement vectors replace the ones currently in the registers. Half of this block (8 by 32 pixels) is a duplicate of the last block from the reference frame, because the best alignment might lie across the boundary. This process is repeated twice more. At this time, the registers hold the best motion displacement estimate for all four blocks. The motion estimator now generates an interrupt for the main processor.
Although the process of doing four searches simultaneously might seem a bit strange, the design optimizes the memory-access strategy. Loading the reference block into on-chip memory makes access fast. This is important, because many parts of the reference block will be compared to all 64 pixels in each 8- by 8-pixel block. Loading four 8- by 8-pixel blocks at once makes sense, because many of these pixels will also be compared aga
inst all 64 pixels in each of the four blocks.
How important is the motion estimator? Steve Purcell, C-Cube Fellow and the chief architect of the chip, says that it would take about 2000 MIPS of processing power to duplicate the work done by the motion estimator, roughly the cumulative might of 18 Intel Pentium processors. This is because the chip is able to chain together the work of 32 logical units that are doing part of each comparison in parallel. The computational work is so regular that it is easy to do in parallel.
After motion estimation is complete, the VCP uses special functional units for processing the last layer of encoding. In this layer, the 64 coefficients computed for each frame of the DCT must be compressed one last time by using a variable-length encoding scheme. This method gives common values short vectors and rare values the longer ones. The net effect is that the entire transmission shrinks in size.
The Final Results
The VCP has two functional units for handling
this process, one for compression and the other for decompression. Both act as smart buffers that hold all the incoming and outcoming data until it is needed and then transform it while it is waiting. The incoming buffer, for instance, waits until it has the coefficients for an entire frame before passing them on to the main CPU, which assembles the digitized image.
The main CPU could compute this information. Most of the standard compression programs for PCs will use some form of Huffman encoding from time to time, but it is inefficient to do this 1 bit at a time. Most machines are not successful at writing variable word lengths because they are optimized to load values aligned on word boundaries in standard, 16- or 32-bit sizes. As before, the standard processors are optimized for standard word sizes--not variable bits--and these differences are significant enough to merit the additional functional units.
The Memory Hierarchy
Most processing chips focus their attention on one stream of ins
tructions that must be done in sequential order. In contrast, the work going on in the center of the VCP is more like a three-ring circus: Different functional units on the chip need to access both the main DRAM holding the images and the video I/O streams. The memory hierarchy is tuned to make it easier for the chip to bring information on and off the chip successfully.
Like most general-purpose CPUs, the VCP uses caching to speed up memory access. It uses an instruction cache and a data cache to handle instruction and data flow to the CPU pipeline. The data flowing in and out of the variable-size compressor and decompressor bypasses this cache, because it is unlikely that any of this information will be used again. Putting the cache between these units and main memory would just fill the cache with nonreusable data and add complexity to the cache circuitry.
Splitting off this data stream also allows the cache to be much more efficient. The VCP cache achieves hit rates of nearly 100 percent, be
cause the programmer can anticipate the needs of the program perfectly. In many cases, the programmer can request data almost 100 cycles before it is needed to give the memory system ample time to fulfill the request.
The memory-access circuitry is also flexible enough to access images stored in different formats. For instance, it is common to store a bit map in row-major order, where each 32-bit word contains 4 bytes that are next to each other on the same row. The VCP, however, often converts bit maps into a format that stores 4 bytes from a 2- by 2-byte grid into one 32-bit word. Some of the special CPU instructions for computing statistics such as spatial frequency use this format. The memory circuitry is designed to read and write blocks of data in either format, so it is possible to import data in row-major order, operate on it in 2- by 2-byte block format, and then rewrite it out in row-major order without doing complicated rewriting. The CPU doesn't need to worry about this, because the memory
hardware automatically rearranges the bytes.
Toward Tomorrow
In recent years, the relentless speed improvements of general-purpose RISC chips have made many special-purpose hardware implementations obsolete. The high cost of developing hardware with only a limited market could rarely compete with the ease of using RISC chips developed for larger markets. Video compression and decompression, though, require so many complicated instructions that it is often impossible to do the job in real time without a $100,000 machine.
The VCP represents an excellent fusion of specialized hardware and the ability to perform general mathematical functions. The designers deliberately left extra programmability in each of the functional units to match different MPEG implementations. Because MPEG is not completely specified--it is a combination of a set of guidelines and a final format--it is entirely possible that the MPEG compressors from different companies will generate output with different qualities. Every
one is free to implement the encoding algorithms differently. For instance, the VCP lets you limit the motion estimator to 8- by 8-pixel blocks, because many MPEG implementations work at this level.
This flexibility is important. For example, it lets some companies use a less complicated compression algorithm that is easier for a general-purpose processor to decompress. The algorithm would still need the power of the VCP and its multiple functional units for compression, but it wouldn't need the VCP for decompression. This lets companies offer video systems at different capabilities and price points. That, in turn, hastens the day when video will become a common data format on your system.
Illustration: INSIDE THE VIDEORISC COMPRESSION PROCESSOR
Video interface
Instruction cache
Data cache
Motion estimator
CPU
DSP
Variable-length encoder
Variable-length decoder
Illustration: The VCP doesn't store the entire 40- by 24-pixel block internally because such a large blo
ck would slow processing throughput; thus, the hardware compares the blocks in four phases.
Illustration: The MPEG algorithm searches a 40- by 24-pixel block surrounding a single 8- by 8-pixel block to find the best alignment between the reference frame and the current one.
Illustration: The motion estimator hardware lets the same 16- by 24-pixel range be used for phases 1, 2, 3, and 4 of four different 8- by 8-pixel blocks concurrently. This limits the number of times that a piece of an image must be loaded into the motion estimator.
Illustration: Although the CPU controls the chip, it is the motion estimator, the memory subsystem, and the variable-length encoders that provide most of the chip's computational power.
Peter Wayner is a BYTE consulting editor. You can reach him on the Internet at
pcw@access.digex.com
or on BIX as "pwayner."