Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesParallel Goes Populist


May 1997 / Features / Parallel Goes Populist

Parallel computing finally sheds its supercomputing shackles and helps ordinary PCs find the processing power they crave.

Dick Pountain

When will parallel processing arrive in mainstream computing?" It's one of those infuriating questions that always seems to require the answer, "Next year."

We continually need more and more computing power to run applications such as 3-D graphics, MPEG video, and huge SQL queries. Using multiple proce ssors seems an obvious way to supply that power. The problem is software.

Although operating systems such as Windows NT support multiple processors, desktop PC applications have yet to fully exploit this capability through internal multithreading. Even with more-sophisticated enterprise-level software, a portability problem exists: Until recently, parallel programming techniques have been so hardware-dependent that a program that runs on one parallel architecture needed to be rewritten to run on a different architecture.

The result? Parallel processing remains within the realm of supercomputing, a technology for defense departments, aircraft designers, physicists, and weather forecasters. These groups write their own code and budget accordingly; their world is a long way from the world of shrink-wrapped software.

But that profile is changing. Today, commercial applications ranging from multimedia servers to data warehousing systems are demanding the power that par allel processing offers. At the same time, three main technical developments -- new hardware designs, clustering, and advances in program-code portability -- are allowing parallel processing to break through into wider markets.

Change is coming at a critical time. Parallel computing's supercomputing shackles, combined with a shrinking defense market, created a crisis in recent years that sent technology leaders like Thinking Machines and Kendall Square Research to the wall and even forced a famous name like Cray to merge with Silicon Graphics. At the same time the desktop PC market is groping its way toward parallelism.

Switching to Success

The first of the three important trends -- hardware innovations -- sees designers moving to high-speed switched interconnects . These interconnects can make distributed-memory massively parallel processing (MPP) machines appear to programmers like shared-memory symmetric multiprocessing (SMP) machines, which enormously si mplifies programming them.

The key to success in designing a parallel computer is to get the right balance between the processing power of the CPUs and the communication bandwidth between them; any imbalance here will mean that some of the CPUs will be starved of data and the advantage of parallelism lost. The crucial metric is the bisectional bandwidth of the whole system, which you derive by conceptually dividing the network of CPUs in half and measuring the data rate across this divide. The result reflects the potential performance on real-world problems when the data is not optimally placed. Early designers of parallel computers experimented with exotic topologies like 3-D toruses and higher-order hypercubes to optimize this balance. Today, the emphasis has moved to architectures where the CPUs are connected via digital crossbar switches so that any node can be connected quickly to any other. In this way, each CPU can be just a hop or two away from any other, regardless of the physical topo logy.

Such switched-interconnect fabrics make it possible to allocate a single large virtual address space to all the separate memories of a physically distributed system. Thus, the machine appears to programs as a shared-memory machine but without the bus-contention or arbitration bottleneck because when two nodes are actually connected, they alone have access to that piece of interconnect. By placing crossbar switches on every processing node, you can make the bisectional bandwidth scale linearly -- every time you add more processing power, you are also adding more communications bandwidth. The same technique applies equally to I/O, so disk drives may also be connected via crossbar switches, allowing you to momentarily attach any disk to any CPU node.

To see how switched interconnect works, consider Silicon Graphics' new S2MP multiprocessor server technology, employed in the Origin2000 multimedia server. SGI describes S2MP as a distributed shared-memory architecture. It sca les up to 128 processors (MIPS R10000 RISC chips) and 256 GB of memory. This physically distributed memory -- up to 4 GB per node -- appears to the Irix operating system as a single shared memory, thanks to a pair of custom ASICs, the hub and the router containing six-way crossbar switches, and a superfast point-to-point wiring called CrayLink (which SGI inherited from Cray Research). A third ASIC, called Crossbow, provides switching to I/O devices. Every node board contains one or two CPUs and a hub, and they connect via CrayLink through router boards that can link any node to any other. Each hub also controls a separate directory memory to store information about the cache status of all the main memory within its node. The hub uses this information to provide scalable cache coherence and migrate data to a node that accesses it more frequently than the present node. As a result, the bisectional bandwidth scales linearly, at least up to 32 processors:

   8 processors      1.25 GB per second
  16 proc
essors      2.5  GB per second
  32 processors      5.0  GB per second

Clustering for Comfort

The trend toward clustering, where groups of workstations or PCs employ a middleware layer to make them behave like a single parallel computer, means that companies can leverage their existing hardware investment by using the LAN as a "supercomputer" during off-peak periods, and thus lower the entry barrier to parallel computing (in theory, anyway). Clustering treats a network of separate computers as if it were a single computer. This approach has been used for many years in the minicomputer sector by firms like DEC, Tandem, and Pyramid for high-availability, fault-tolerant servers.

You can implement clustering using software alone, a concept made popular by PVM (Parallel Virtual Machine), a message-passing environment. There are implementations of PVM for many flavors of Unix and now for Windows PCs (see "Parallel Computing Windows Style," May 1996 BYTE). This approach create d the "supercomputer" -- actually a network of 117 Sun workstations -- used to render frames for the movie Toy Story .

The Message Passing Interface (MPI), with language bindings for C++ and Fortran, lets you build portable parallel applications to run on clusters of workstations. It consolidates the best features learned from PVM, the European PARMACS, and several proprietary systems from IBM, Intel, and nCube. The second version, MPI 2, has just been released and adds advanced features like dynamic process management, parallel I/O, and real-time extensions.

For problems like rendering, in which computation outweighs communication, a cluster will deliver acceptable performance even over an Ethernet. But other problems need a faster transport more closely matched to the power of modern CPUs. A typical example is the Alpha AXP cluster at Tampere University in Finland; 21 DEC Alpha workstations connected via an optical asynchronous transfer mode (ATM) switch operating at 10 Gigabits per seco nd and delivering supercomputer performance of 4.6 GFLOPS. Such message-passing clusters are suitable for scientific and engineering applications, where the use of PVM or MPI can result in a great deal of code portability for semicustom software. But in the commercial sector, where accelerating SQL database queries is the main task, there's a new emphasis on SMP clustering.

Basically each node in a cluster becomes an SMP computer in its own right, with a smart interconnect designed to make the whole cluster look to software like it's a single SMP machine, thus there's no need to change any application software when you add more nodes. This sort of architecture is often referred to as nonuniform memory access (NUMA) because the speed of a memory reference is different within an SMP node and between nodes. NUMA promises to combine the easy programming benefits of SMP with the scalability of MPP and it's likely to be the future for parallel computing, especially once Microsoft supports it through its Wolfp ack technology.

A good example of clustered SMP is Sequent's NUMA-Q architecture (see "The Network in the Server," July 1996 BYTE). NUMA-Q is built out of nodes called "quads" that are complete SMP computers, each containing four Pentium Pro processors on a 500-MBps shared bus and a proprietary 1-GBps interconnect called IQ-Link. You could use IQ-Link for message passing, but Sequent has developed middleware that makes the links memory-coherent so that the whole cluster appears to be one large shared memory. IQ-Link monitors the Pentium Pro processor bus and so knows when it must respond to requests for memory locations outside the range of memory addresses assigned to this quad. The link examines its own large cache and, if the requested data cannot be found there, forwards the request to the other quads quite transparently to the database and application software.

Tandem's ServerNet implements a somewhat similar NUMA architecture, using a packet-switched interconnect based on 800-Mbps six-way cr ossbar switches and a "worm-hole" routing algorithm (i.e., message headers may leave a node before the tail has arrived) to minimize latency. The great advantage of these clustered-SMP architectures for commercial database operations is that they will work with common software like Windows NT Server, SQL Server, or Oracle, and the Intel-based node boards should be relatively inexpensive.

Portability

The lack of portability of program code between different parallel architectures remains a major stumbling block for new commercial customers, companies that typically place great importance on after-sales support. Parallel computing is caught in a vicious circle: The lack of commercial software hinders parallel hardware vendors from selling machines, while software vendors will not spend money porting their code to parallel machines because the market is too small.

However, newly invented software layers now disguise the underlying machine's topology and allow programs to be more ea sily ported between machines. For example, bulk synchronous parallelism (BSP), a new parallel programming model, can allow the same parallel application to run on an SMP machine, a cluster using PVM or MPI, or a distributed-memory MPP machine. Several of these trends may combine within one architecture, as in Sequent's NUMA-Q architecture, which employs clusters of SMP machines with a fast switched interconnect.

Grand Strategy

Don't think that the traditional supercomputer market has gone away completely; supercomputers are a strategic resource for the defense industry, so no government would let that happen. There are still several manufacturers working on MPP machines to solve the Grand Challenge problems in particle physics, fluid dynamics, and atmospheric modeling. In the United States, MPP activity is concentrated around the current main source of funding, the Department of Energy's Accelerated Strategic Computing Initiative (ASCI) program, which was set up to develop si mulation technologies that can check the safety of nuclear weapons without underground testing. The chief centers for ASCI are the Sandia, Los Alamos, and Lawrence Livermore laboratories.

Intel's huge MPP machine, called ASCI Red, made headlines last November when it performed more than 200 GFLOPS on the MP Linpack benchmark. When complete, the machine should exceed the elusive teraFLOP barrier. The 11-cabinet configuration (out of an intended 86) contained 688 compute nodes with 1376 200-MHz Pentium Pro processors and more than 80 GB of memory. It recorded 213 GFLOPS, a peak performance of 400 MFLOPS from each two-processor node. IBM is also involved in ASCI with its RS/6000 SP system.

In San Francisco, Tera Computer builds what some believed to be an extinct species: a parallel supercomputer based on proprietary computing nodes. The Tera is a shared-memory machine that uses a clever multithreaded CPU architecture and a packet-switched interconnect fabric; each processor switches context every 3- ns cycle among as many as 128 distinct instruction streams ("hardware threads"). Each stream may issue as many as eight memory references without waiting for earlier ones to finish, which hides much of the memory latency. At the 333-MHz clock speed, each processor has a peak memory bandwidth of 2.67 GBps, and the machine can support up to 256 processors (700 GBps in total), which Tera claims is 95 percent sustainable.

A European Community initiative called Europort (part of the ESPRIT program) has successfully encouraged the porting of some of the most widely used industrial design applications (for automobiles, aerospace, pharmaceuticals, and cartoon animation, among others) to a variety of parallel computers. The projects have proved that the performance increases more than justified the cost. Europort's success stems from its uniform programming approach based on message passing with PVM. But also contributing was the initiative's organizational structure: Each porting consortium had to include not just the code vendor but several end users of the code and a parallel programming specialist.

Perhaps asking when parallel computing will hit the mainstream isn't the right question. Rather, we should ask if one declining and one growing industry sector -- supercomputing and PC-based client/server computing, respectively -- can combine to make parallel computing viable for general business. Thanks to the three important technical innovations we're seeing today, the answer appears to be "Yes, they can."


Where to Find


Intel

Santa Clara, CA
Phone:    408-765-8080
Internet: 
http://www.intel.com


Microsoft

Redmond, WA
Phone:    206-882-8080
Internet: 
http://www.microsoft.com


Sequent Computer Systems

Beaverton, OR
Phone:    503-626-5700
Internet: 
http://www.sequent.com/


Silicon Graphics Mountain View, CA

Phone:    415-960-1980
Internet: 
http://www.sgi.com


Tandem Computers

Cupertino, CA
Phone:    408-285-6000
E-mail:   
info@tandem.com


HotBYTEs
 - informat
ion on products covered or advertised in BYTE


Pushing Parallel to Wider Markets

illustration_link (31 Kbytes)

Silicon Graphics' new multiprocessor server technology, S2MP, eases some formidable programming problems.


Interconnected I/O

illustration_link (16 Kbytes)

The same switche d interconnection technique used to link processor nodes can also join processors to storage devices.


Dick Pountain is a BYTE contributing editor based in London. You can reach him at dickp@bix.com .

Up to the Features section contentsGo to previous article: Go to next article: Sorting Out SMP and MPPSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network