Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesMultimedia Powerhouse


June 1994 / Features / Multimedia Powerhouse

TI's new MVP chip brings parallel-processing power to multimedia applications

Karl M. Guttag

More and more, computers and applications are incorporating real-world data types, such as video and voice. To some people, dealing with these data types is a headache; we at Texas Instruments see it as a business opportunity. We designed a DSP (digital signal processor)--the MVP--to bring parallel-processing power to bear upon the problems of multimedia.

The MVP integrates onto a single die five fully programmable processors, a sophisticated DMA controller with an external memory interface, 50 KB of SRAM (static RAM), and video timing control (see the figure "The MVP" on page 58). Of the 50 KB of SRAM, 32 KB can be shared among all five processors to support many different parallel-processing approaches. The MVP chip i s targeted at solving the problems that are inherent in multimedia and other applications that require a large amount of processing.

Driven by Design

The MVP did not spring fully formed from the memories of TI's CAD workstations. Three basic algorithmic areas drove the MVP's design definition: image processing and recognition, video and still-image compression, and high-performance computer graphics.

The design of the MVP's signal-processing components was driven by the needs of image-processing, image-recognition, and image-compression algorithms. The latter category includes convolution and frequency-domain transforms that are multiplication-intensive. For example, the JPEG and MPEG standards require DCT (discrete cosine transform) frequency transforms, so TI paid particular attention to DCT performance and precision.

While the algorithms drove the design of the signal-processing components, the sheer volume of signal processing that is required by these algorithms prompted the decis ion to include multiple DSPs on the IC. The design team also discerned that, in general, the primary multimedia algorithms required 16-bit or less fixed-point multiplies with 32-bit accumulates. Higher precision was not required in the signal-processing components.

Historically, DSPs have not been very good at processing the bit-field manipulations used in some compression algorithms or at manipulating multiple-pixel quantities, such as those encountered in graphics block moves. The MVP's signal processors differ from traditional DSPs most markedly in their ability to manipulate bit fields and process multiple pixels in parallel through their data paths. To reflect these differences, we call these components advanced DSPs, or ADSPs. The MVP contains four of them.

One important point about ADSPs is that, although they are optimized for certain types of algorithms, they don't dedicate hardware to any specific algorithm. The goal of the MVP is to support elemental operations that can be used to imp lement any algorithm. This approach pays dividends when vendors develop new algorithms for current problems and when they use the power and programmability of the MVP to develop completely new applications.

Inside the Advanced DSPs

The MVP's four ADSPs provide most of the chip's raw performance. Each can perform in excess of 10 RISC-like operations per cycle (see the figure "The Advanced DSPs").

To specify the multiple parallel operations that they are able to perform, the ADSPs employ a wide instruction word of 64 bits. This instruction word has fields that independently control the data unit, along with its multiplier and data path, and the two address units. All instructions nominally execute in a single cycle.

Each ADSP has a register file of 44 programmer-visible registers. Any register can be a source to, or a destination from, the ALU data path. This includes the program counter, the address registers, and the loop-control registers. Conditional PC (program counter) relative jum ps, for example, are performed by conditionally writing to the PC. The register set is broken into files based on register functions. Most of the registers support more than one access per cycle, with the register file in the data unit supporting over 10 accesses in a single cycle.

An ADSP data unit consists of three major elements: the data-unit user registers, the multiplier, and the ALU data path. The instruction set supports independent multiplier and ALU data path operations. The multiplier can perform one 16- by 16-bit multiply or two 8- by 8-bit multiplies in a single cycle. The multiplier also has a rounding option, a direct result of maintaining the specified accuracy for the video-compression standards. Whereas the ALU data path can operate on any of the registers, the multiplier is restricted to operating on eight data-file registers.

The ALU data path includes a barrel rotator, a mask generator, a 1-to-n bit expander (which is used for binary-to-color transforms, among other things), and a three-input ALU that can combine the mask or expander output with register data to create over 2000 different processing options. The ALU has a 32-bit data path that performs logical and arithmetic functions, and it can combine these to support masking or merging in a single pass. The ALU can be split into smaller sections to perform multiple 8- or 16-bit operations in parallel.

Normally, ALU operations set four status bits: carry, negative, zero, and overflow. Any or all of these bits can be protected from being modified by the current instruction. The instruction set supports both conditional source selection between a pair of registers and writing of the result based on status.

The two address units are nearly identical, and together they can perform two memory operations per cycle. Each memory operation is a load or a store that can be totally independent of the data-unit operation. The address units add an immediate or register index to an address register to form the address. The re sult of the address computation can optionally modify the address register to facilitate stepping through a memory array.

Like the ALU data path, the two address units support conditional operations. The source for a store can choose between a pair of registers, and the decision whether or not to load a register can be based on status. The source or destination of a store or a load can be any of the 44 registers. A conditional load of the PC performs a conditional jump, which can free up the ALU data path to perform other operations.

Either or both address units can be used to perform a data operation in place of a memory transfer. In such a case, the result of the address data path is written to the destination register instead of data being fetched from memory. This capability, along with conditional loads of the PC, speeds up functions that are computationally bound or jump-bound rather than memory-access-limited.

Three zero-overhead loop controllers are included in each ADSP. Because each ADSP instruction can do so much in parallel, key loops often require very few instructions. Having three loop controllers even allows for nested loops to have zero loop-control overhead.

Each loop controller has a set of registers that specifies the starting address, ending address, current loop count, and the initial count (for nested looping). Once the loop-control registers are initialized, loop counting and branching are performed with zero overhead in terms of execution time. The loop controllers can be used to perform zero-overhead branches to a run-time patch in code segments. Because the loop-control registers sit in the register file, you can write computational results to a loop-count register to specify whether or not a branch is taken based on a zero result.

Instruction prefetch and the instruction cache are controlled from within each ADSP. Instructions are executed in a three-cycle pipeline, with a new instruction starting every cycle, assuming that no stalling condition has o ccurred. The ADSPs' instruction controllers support interrupts and emulation control. If a cache miss occurs, the cache controller will make a packet request to the TC (Transfer Controller; described later) to get the new cache sub-block transferred.

Beyond DSP

In addition to signal processing and bit-level manipulations, multimedia processing requires many other types of operations, such as 3-D graphics and audio processing. These applications often require high-precision floating-point computations. Because a single FPU was all that could fit on the MVP's die, floating-point capability was not incorporated into the ADSPs but built into a separate processor called the Master Processor, or MP (see the figure "The Master Processor"). The FPU contains a special set of instructions to support 3-D graphics transforms and DSP-like floating-point operations.

The MP is a general-purpose RISC processor that is programmable generally in high-level languages. It performs operations requiring a higher l evel of precision than is available from the ADSPs.

The MP integer unit has a 32-bit instruction word that performs integer register-to-register or load/store instructions nominally in one cycle. The basic load or store operation adds an index to a register containing the base address to form the memory address. To step an address pointer through memory, the instruction can optionally update the register that's used as the base-address register with the result of the add.

The IEEE-754 FPU is pipelined and runs in parallel with the integer unit. In normal operation, a new floating-point add or multiply can be initiated every cycle. A special set of parallel floating-point operations can initiate a multiply, an addition or subtraction, and a 64-bit load or store with automatic increment addressing every cycle.

The register file contains 31 32-bit registers that are common to both the integer unit and the FPU. The registers are scoreboarded for floating-point results and memory-load operatio ns. The scoreboard allows the MP to continue execution; the MP will stall only if an instruction tries to use a register before the prior operation has loaded its result. As with some other RISC architectures, R0 is a dummy register that is always read as zero.

Instruction flow and cache management are controlled within the MP. A three-stage pipeline starts a new instruction every cycle, assuming no stalling conditions have occurred. The instruction controller also deals with interrupts and emulation support. The MP has hardware for managing the 4-KB data and 4-KB instruction caches. When a cache miss occurs, the MP's cache controller automatically makes a packet request to the TC to get the necessary data transferred.

Communications Matters

The final important consideration in designing the MVP was the need for high data bandwidth for off-chip communications and interprocessor communication. This requirement is common to signal processing, floating-point processing, and graphics processing. Much of the early architecture definition focused on achieving high bandwidth, making sure that the processors wouldn't have to wait on data, and ensuring that interprocessor communication would not be a bottleneck.

To address internal communications issues, we incorporated 25 small, 64-bit-wide RAMs on the MVP chip. These are accessed by the processors through a crossbar interconnection. To handle external communications, we incorporated on-chip an intelligent DMA controller for handling block data movement: the TC (Transfer Controller), mentioned earlier.

The 50 KB of on-chip memory is physically separated into 25 2-KB RAMs. 18 KB of this memory (nine 2-KB blocks) is dedicated to specific functions. Every ADSP uses one 2-KB block as a hardware-managed instruction cache that is loaded by the TC in the event of a cache miss. The MP uses two 2-KB blocks as an instruction cache and two more as a data cache. Finally, one 2-KB block is reserved as fast RAM and is accessible only by the MP and TC.

The remaining 32 KB of RAM is shared and can be accessed in chunks of 8, 16, 32, or 64 bits at a time; the large number of individual RAMs supports many parallel accesses. A crossbar-switch network lets the following accesses to shared RAM occur simultaneously: two 32-bit accesses by each ADSP, a 64-bit access by the MP, and a 64-bit access by the TC.

Crossbar-switch connections are determined by the most significant bits of each address on a cycle-by-cycle basis. If more than one access is requested of the same RAM block in a cycle, round-robin-prioritization hardware determines which processor is allowed access and which processor is stalled until the next cycle.

All the shared RAMs and the one MP/TC 2-KB RAM block reside at fixed addresses and are managed by software. Generally, the processors send packet-request commands to the TC to load data before it is needed for processing and to store results after processing. Because of the number of individual RAMs available, these packet transf ers can be set up so that they do not conflict with other accesses and therefore work fully in parallel with other processing.

Crossbar-shared memory is the most generally flexible multiprocessor memory architecture because it puts the fewest restrictions on how data must be organized. While the crossbar involves nearly 1000 data and address lines that must be connected between the processors and memory, it becomes practical to use because everything is integrated on one chip. The crossbar's flexibility translates into better efficiency, in terms of both execution speed and ease of programming.

The Transfer Controller

The transfer controller is a very intelligent DMA controller that can autonomously transfer packets of data between the MVP and external memory (see the figure "The Transfer Controller"). The TC can address memory as either a linear or a multidimensional array of data or even as a complex shape, such as a polygon. The TC is byte-addressable and will automatically handle byte mis alignment between the source and the destination. Requests for packet transfers can be made by any of the processors under program control, as well as by the cache controllers and the video controller for display refresh. Transfers can also be initiated by external requests.

The TC processes the source and destination addressing with independent controllers. The burst FIFO (first-in/first-out) supports DRAM page and burst modes and buffers byte-misaligned accesses to more efficiently move data. A separate cache-access controller can break into the middle of program-controlled packet transfers to service cache misses. The request-prioritization/control logic prioritizes the many potentially active requests and starts transfers. The TC will automatically suspend and later resume lower-priority requests when a higher-priority request occurs.

The external memory interface provides support for ROMs, SRAMs, DRAMs, and VRAMs (video RAMs). The support for DRAMs, including timing control and address mult iplexing, is relatively new in DSPs. The combination of fast on-chip SRAM and an external DRAM interface supports high performance while also reducing system costs.

The TC is capable of transferring data between sources and destinations that have different dimensions. In graphics and imaging, for example, it is common for the TC to fetch data from an image region as a 2-D array and bring it on-chip for processing as a linear array. After processing, the results stored in a linear array can then be stored off-chip as an x,y array. The ability of the TC to make these transformations autonomously greatly improves the efficiency of processing by the ADSPs and the MP.

The MVP chip has two sets of video-timing counters and registers. The video controller keeps track of horizontal and vertical synchronization and blanking timing, as well as supporting a 2-D border region. Each counter has its own asynchronous clock input and has a set of synchronization, blanking, and border signals. The synchronizatio n signals can be individually set up as outputs (for display) or inputs (for video capture). An SRT (shift-register transfer controller) has comparators that cause shift-register transfer cycles for VRAMs or cause packet transfers for DRAM base-display memory.

Support Issues

Although perhaps not as exciting as the microarchitecture, testability and software debugging were important concerns in the MVP design, and roughly 10 percent of the chip's nonmemory transistors are dedicated to these functions. All storage nodes can be scanned in or out to support boundary-scanned testing. Other features in the scan path support emulation loading and the dumping of the internal state of the MVP. Address comparators were also added to support real breakpoints.

A complete suite of software support has been developed for the MVP chip. Assemblers and C compilers have been developed for the MP and the ADSPs. A C-like algebraic assembler for the ADSPs supports the many different operations that they can perfo rm. Software-simulation and hardware-emulation tools that use the same graphical interface are available. An imaging and graphics software library is also currently being developed. An MP-resident executive supports multitasking and intertask synchronization and communications. Under the executive, tasks running on the MP issue commands that are carried out by the ADSPs.

Putting It All Together

Through the use of parallel processing, the MVP puts a new level of programmability and performance on a single IC. Not only does the MVP integrate five processors on a single chip, but each processor can execute many operations in parallel. The MVP is implemented on a 342mm superscript 2 die, using a 0.6-micron, three-metal-layer process. It uses a 3.3-V power supply and will initially run at 40 MHz, with 50-MHz parts due next year. The MVP is packaged in a ceramic pin-grid array, but it will eventually move to a composite metal-plastic package. It draws 7.5 W at 50 MHz.

The MVP is capable of performi ng the equivalent of over 2 billion RISC-like operations per second. In specific applications, a single MVP can do the job of over 10 of the most powerful DSPs or general-purpose processors previously available. The MVP can move 2.4 GB of data and 1.8 GB of instructions within the chip--plus shuffle 400 MB of data to off-chip memory--per second.

Some of the obvious uses for the MVP chip will be multimedia applications, such as videoconferencing; document-image processing, from digital copiers to real-time OCR; 2-D and 3-D graphics; audio enhancement and compression; telecommunications; and virtual reality. But the real virtue of the MVP is that its combination of programmability and performance will undoubtedly lead to applications that are as yet unimagined.

ACKNOWLEDGMENTS

I wish to thank all the people who made the MVP a reality, especially co-architects Bob Gove, Nick Ing-Simmons, Keith Balmer, and program manager Walt Bonneau. The development of the MVP was a worldwide TI project, involv ing TI employees in Houston; Dallas; Bedford, England; and Bangalore, India.


CPUs vs DSPs



  CPU                                     DSP
--Generalized functions                 --Signal processing
--Single data bus                       --Multiple data buses
--Hardware cache control                --Programmer-accessible caches
--Generalized addressing                --Loop-optimized addressing
--Programmer manages microarchitecture  --Hardware manages microarchitecture




Target Applications



Videoconferencing
Document image processing
X Window System terminals
Imaging
3-D graphics
Compression
Cellular base stations
Virtual reality
Video servers
Neural networks


Figure: The MVP Each ADSP has two independent 32-bit data ports (G and L in the figure) and a 64-bit instruction cache input (I). The MP has a single 64-bit data port (C/D) and a 32-bit instruction port. The TC (Transfer Controller) has a 64-bit internal port and a 64-bit external port. The ports of the various processors and TC are connected to 25 2-KB RAMs via a crossbar-switch network. The crossbar supports approximately 2.4 GBps of data, plus 1.8 GBps of instructions.
Figure: The Advanced DSPs Like other DSPs, those integrated with the MVP are built to support multiple data accesses in a single cycle and to optimize the performance of the multiply-accumulate operations that characterize signal-processing algorithms. In addition, the ADSPs also support bit-field and pixel operations, making them powerful imaging and graphics processors as well.
Figure: The Master Processor The MP consists of both an integer unit and an FPU. Control of the instruction and data caches is integrated into the processor, although the cache memory resides in the SRAM array.
Figure: The Transfer Controller In addition to controlling the movement of data and instructions on- and off-chip, the TC performs transformatio ns where data is processed in an order different from that in which it is stored. The TC also contains a DRAM and VRAM controller.
Karl M. Guttag is a TI Fellow and chief archi-tect of the MVP chip. You can contact him on the Internet at karl@video.sc.ti.com or on BIX c/o "editors."

Up to the Features section contentsGo to previous article: The Palette-Optimization AlgorithmGo to next article: 1994 Readers' Choice AwardsSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network