percomputer companies, from old stalwart Cray Research to new kids like Silicon Graphics, Inc. (SGI), are alive and well.
But today you might well mistake a next-generation supercomputer for a filing cabinet tucked into an office corner or even -- shock hazard -- beside a desk. There's still some some custom parts inside these systems, but they're mostly off-the-shelf workstation components. Nevertheless, such a box packs more processing might than its ancestors did, and it sports a double-take price tag that a mid- to large-size business can afford: around $100,000 and up. As a business and its computing demands grow, the supercomputer can grow along with it, by the addition of extra processors and hard drives.
A Fast History of Speed
In the sixties and seventies, supercomputer companies like Cray Research got their start building systems that could handle formidable scientific and engineering probl
ems that used floating-point calculations extensively. The intense computational demands of this type of work -- much of it involving nuclear research or aerospace design, motivated by the arms race -- combined with the necessity of quick results, meant that cost was no object in obtaining the fastest hardware possible. For example, the first Cray-1, like the one shipped to the Los Alamos National Laboratory in 1976, had a peak speed of 167 MFLOPS and cost anywhere from $4 million to $11.2 million, depending on the hardware configuration.
These systems used a
vector-processing
design that reflected typical engineering problems. Vectors are data arrays representing specific quantities. For example, you might model a jet engine with one vector representing the engine's structural elements, another specifying thermal characteristics within the engine, and a third detailing the fluid flow through the engine. A program that simulates engine operation mathematically combines the vectors according to
physical models, iterating from time-step to time-step. Properly done, the simulation allows you to study the behavior of engine parts as the engine runs, which enables you to spot flaws in a new design or evaluate the strength of a different engine material without ever building a real prototype.
To boost performance, supercomputer designers oriented the system architecture around these vector operations, adding extra hardware to the processor to manipulate data vectors efficiently. Pipelined logic units allowed operations such as multiple memory accesses and the rapid combination of data vectors to overlap. To run at a high clock rate, these special-purpose processors employed bipolar logic circuits -- fast, but so hot that they needed liquid cooling systems. High-performance -- and high-price-tag -- peripherals handled the supercomputer's I/O demands.
Compounding the system price even further was the cost of software. Some supercomputers used proprietary OSes, and researchers often had to write
the modeling software from scratch. Nevertheless, because these research problems required tremendous processing power, and government funding helped foot the bill, supercomputers sold. Supercomputer companies may have sold only a few dozen systems a year, but they charged high margins to defray the expenses of research and building a limited run of parts.
Because producing faster custom processors pushed the limits of fabrication technology, new supercomputer designs became more difficult and more expensive. Cray Research begat Cray Computer in 1989 to develop custom processors out of gallium arsenide, a material with faster switching times than silicon. That company eventually foundered last year due to delays in fabricating such parts.
Ironically, Cray Research itself kept ahead of the competition by taking a different tack entirely. Rather than using faster processors, its 1983 Cray X-MP had up to four custom vector processors to divide and conquer computing jobs. This system, which could del
iver a peak of 941 MFLOPS, cost anywhere from $2.5 million to $16 million.
Many Brains Make Light Work
Then came RISC. While perhaps not as powerful as a custom processor, a RISC processor had a significant price advantage. Manufactured by the hundreds of thousands rather than by the dozen, their enormous economies of scale made RISC processors much less expensive by comparison. With multiple processors working together, a complex problem could fall by sheer numbers rather than by raw speed.
Another speed boost: The competitive nature of the workstation market had RISC vendors striving to one-up the competition by boosting chip performance. According to Jack J. Dongarra of the mathematical sciences section at Oak Ridge Laboratory, who keeps track of the supercomputing industry, "RISC processors got faster quicker than anyone expected. Although RISC first appeared in a commercial system about a decade after the first supercomputer, their floating-point performance lags [behind
that of] traditional supercomputers by only an order of magnitude. We see the current generation of RISC processors matching the performance of last generation's traditional supercomputers on a per-processor basis."
Employing gangs of processors in parallel affects how you write programs. You divvy up the data array, assigning different portions to different processors. Every processor runs its portion of the program in parallel (i.e., simultaneously) with the other processors. As the program progresses, every processor exchanges data with its neighbors, as shown in the figure
"Processing Schemes"
. This scheme is known as
parallel processing
. A parallel-processor architecture can be scalable; that is, to get more computing power, you can add more processors to the existing system.
But don't abandon vector processing just yet. In certain situations, a vector-processing system delivers better performance than a paral-lel-processing system, especially when dealing with
complex simulations involving huge data arrays. That's because the average memory-access times can be shorter with vector processing, even with a large memory space. In contrast, a parallel-processing system with lots of memory might have to wait quite a while for data to move from one part of the system to another (because, as on a network of PCs, a packet might rattle around through dozens of nodes before reaching its target).
Current vector-processing supercomputers are also scalable (up to a point), in that you add more processors to boost performance. The Cray X-MP, though a vector-processor machine, had a scalable architecture. Cray still sells scalable vector-processing supercomputers, such as the T90 and J90. Over the long haul, however, the price point of RISC processors will allow the construction of ever-larger, more powerful parallel-processing systems.
Supershakeout and Recovery
During this decade, recession and the Cold War's end have shrunk research funds dras
tically. The savvier supercomputer companies began exploring other market possibilities before the money dried up. Businesses were keen on using the processing power of a supercomputer, but they weren't in a hurry to buy: The huge price tag was hard to justify, especially if a business's work load outgrew the system's capabilities in just a few years. Also, a proprietary supercomputer OS would restrict any custom in-house applications to running on that type of system and no other. And you still couldn't buy supercomputer software at Egghead.
Supercomputer companies responded to these issues. To address price sensitivity, supercomputer designs started to feature merchant or "commodity" workstation parts, such as IBM's Power2, Hewlett-Packard's PA-RISC 7200, and SGI's Mips R8000. Common peripherals, such as SCSI-2 drives, 100-Mbps Ethernet, asynchronous transfer mode (ATM), and Fiber Distributed Data Interface (FDDI) network connections, also became part of the mix and kept costs low.
Commodity par
ts plus a scalable architecture also solved the growth issue: As a business expanded, you could add more (less expensive) processors and hard drives to meet computing demands. Prices for a basic scalable supercomputer today, which can start at around $100,000, reflect the new market reality.
Finally, and perhaps most important, supercomputer companies adopted widely accepted workstation OSes, such as IBM's AIX, Sun's Solaris, and SGI's Irix 6. This drove a stake into the heart of proprietary software concerns. Supercomputers can now tap into the existing base of workstation applications and customers. When an office's work load overwhelms its workstations, the company can reasonably migrate upstream to a supercomputer. It can then use the workstations as terminals to submit jobs to the supercomputer or to handle smaller jobs.
Superpower at Work
Different markets have accepted supercomputers because of their low price and scalable processing power. Oil companies can improve th
eir accuracy in finding future oil and gas reserves by processing seismic data and simulating reservoir flow, reducing their typically astronomical drilling costs. Car manufacturers can simulate crashes of prototypes, speeding new -- yet reliable -- models into the showroom six months ahead of the competition.
You don't have to be in the Fortune 100 to profit from supercomputer smarts, either. Banks use supercomputers to handle thousands of on-line transactions and to call up credit histories within seconds. Retail businesses sift through gigabytes of point-of-sale receipts to data-mine important trends.
While the supercomputers of yesterday might have generated results of interest only to a handful of physicists, today's supercomputers offer something for everyone. Whether it's marketing a new product, speeding catalog-order turnaround, or manufacturing car parts, most jobs can now benefit -- economically -- from the power of supercomputers.
Alcoa Aluminum's adoption of a new-generation Con
vex supercomputer is a classic example. Simulating the casting of large aluminum parts -- possibly for use in cars -- is easier than actually producing them. Walt Wahnsiedler, a technical specialist for Alcoa's process design and smelting, uses a Convex Exemplar SPP1000/CD with eight processors to model aluminum-casting operations. These models help find ways to reduce the stress on the steel dies used in casting, so the dies last longer than the usual 20,000 castings, thus saving costs. They also can reduce defects in the aluminum parts themselves caused by shrinkage during cooling or by pores introduced due to gas.
Alcoa formerly used a Convex C1, and then HP workstations, to run commercial simulation software. Because the Convex Exemplar's SPP-UX OS is binary-compatible with HP-UX applications, Wahnsiedler moved the company's commercial software to the Exemplar with few problems. "It has improved productivity by letting me run more simulations or run simulations that we couldn't do otherwise," he say
s.
For example, car companies would prefer to make parts out of large one-piece assemblies that are more solid structurally than those made of several pieces welded together; plus, fewer parts speeds building a vehicle. "With the Convex," reports Wahnsiedler, "I can now simulate these larger parts, whose models can have up to 1.5 million cells, while reducing defects."
Multiprocessing Architectures
While parallel processing offers a definite cost advantage, its main benefit -- scalability -- can still be difficult to achieve. That's because as you add processors, the contention for shared system resources intensifies. Several different parallel-processing designs address this fundamental problem, each with its own advantages and disadvantages.
The first, symmetric multiprocessing (SMP), has a simple yet effective design, as shown in the figure
"Parallel-Processing Architectures"
. (
The SGI Power Challenge
and the Cray CS6400 En
terprise Server are examples of SMP designs.) In SMP, multiple processors share RAM and the system bus. This design is also known as
tightly coupled
, or "shared everything."
Because SMP shares RAM globally, it has only one memory space, which simplifies both system and applications programming. This single memory space lets a threaded OS distribute its tasks among various processors or lets an application obtain the memory it needs for a complex simulation. The globally shared memory also makes data synchronization easy. SMP is one of the most mature parallel-processing designs. It appeared in the Cray X-MP and similar systems over a decade ago.
However, this global memory also contributes to SMP's biggest problem: As you add more processors, memory-bus traffic increases until you reach a point where the bus gets saturated. Adding local cache memory to every processor can reduce some bus traffic, but the bus generally becomes a bottleneck at about eight processors or more.
Massivel
y parallel processing (MPP) is another parallel-processing design. To avoid memory-bus bottlenecks, MPP does not use shared memory. Instead, it distributes the RAM among the processors so that the hardware resembles a network. Because of the loose distribution of RAM resources, this architecture is also known as
loosely coupled
, or "shared nothing."
To access the memory outside its own RAM, a processor must use a message-passing scheme analogous to network packets. This system reduces bus traffic, because each section of memory sees only those accesses that are bound for it, rather than every memory access, as in an SMP system. This enables large-size MPP systems with hundreds or even thousands of processors.
IBM's RS/6000 Scalable Powerparallel System (SP2 for short)
is an example of an MPP system.
The downside to MPP is that it makes programming difficult, because it breaks memory into small separate spaces. Without any globally shared memory space, running (and writin
g) an application that requires a large amount of RAM (in comparison to local memory) can be difficult. Data synchronization among widely distributed tasks also becomes difficult, particularly if a message must make many hops to the target processor's memory.
Writing an MPP application also requires that you be aware of a program's memory organization. Wherever it's necessary, you have to insert message-passing commands into the program code. Besides complicating the program design, such commands can create hardware dependencies in your applications. However, most supercomputer vendors have safeguarded applications portability by adopting either a public-domain message-passing mechanism, known as Parallel Virtual Machine (PVM), or a developing standard, called Message Passing Interface (MPI), to implement the message-passing mechanism.
How to overcome the difficulties of SMP and MPP? The last parallel architecture, scalable parallel processing (SPP), is a hybrid of both, using a two-tier memory hi
erarchy to achieve scalability. The first memory tier consists of a node that is essentially an SMP system, complete with multiple processors and their globally shared memory.
You build larger SPP systems by interconnecting two or more nodes via the second memory tier so that this tier appears logically as one global shared-memory space to the nodes. The two-tier memory reduces bus traffic, since only updates to keep memory coherent among the nodes occur. SPP thus offers the easy-to-program SMP programming model while providing scalability similar to that of an MPP design.
The Convex Exemplar is
an example of an SPP machine.
Superapps
Because they use workstation OSes and common RISC processors, the new breed of supercomputers inherits a stable of ready-made workstation applications. The SMP architecture that's typical in most systems confers an advantage as well. The globally shared memory lets appropriately modified applications work with larger data a
rrays or distribute the work load in threads across multiple processors.
However, some software vendors don't want to modify their existing commercial applications for fear of introducing bugs with the rewrite. Also, fully supporting multiprocessor hardware using threads, or writing message-passing code for an MPP system, is a daunting task.
Fortunately, you can find ways around these issues, depending on the application type. Many supercomputer OSes can partition the machine so that some or all of the processors function as stand-alone systems. A company can then fold its existing traditional work flow into the computer.
Some partitions function in a mainframe batch-mode operation, letting users submit jobs from terminals or workstations, while other partitions use multiple processors to tackle compute-intensive jobs. Each stand-alone partition can run its own copy of a mission-critical application, so the system can handle multiple users. This arrangement also lets companies consolidate al
l their computing services into one box.
Lotus Notes is an application in point, since large companies use it as a collaboration mechanism. A typical enterprise-level installation might have thousands of users, managed by dozens of servers linked across the country or around the world. A setup such as this gives a system manager nightmares because of the software upgrades and maintenance problems involved in dealing with these widely scattered computers. Furthermore, data synchronization among the separate Notes databases is problematic.
However, a properly configured IBM SP2 supercomputer can come to the rescue. First, you configure the SP2 so that each processor operates independently. Next, you run native copies of the Notes server, one per processor, thus creating an array of virtual servers.
This arrangement consolidates all the servers into a single system and eliminates maintenance and upgrade hassles. While Notes currently limits database access to only one server at a time (because
all the servers draw from the same database file), synchronization issues disappear. This setup also offers mission-critical redundancy: Some virtual servers in the system can operate as backup units, taking charge when a failure disables an active server.
Another existing job that supercomputers can tackle is handling large volumes of information, either for on-line transaction processing (OLTP) or for data mining (on-line analytical processing, or OLAP). Companies can accumulate gigabytes of sales data daily, and supercomputers warehouse the information for evaluation. How large can these data warehouses grow? Cray Research has shown a 48-processor CS6400 managing a 1.6-TB (a terabyte equals 1 million MB) Oracle7 database.
Fortunately, such a setup does not require modification of the front-end application. Instead, the database managing the back end routes the queries to individual processors. Complex queries that generate multiple data tables can go out to several processors, as long as the re
sults of each table do not depend on another table's data. Key to this sort of operation is parallel-processing-savvy database software, such as Oracle's Oracle7 Parallel Server, Informix's PDQ Dynamic Server Architecture, and Sybase's System 10 Navigation Server.
Rob Geller, director of marketing and sales systems at MCI, uses an IBM SP2 system to consolidate the company's demographic and billing data, which used to be stored in several mainframes located around the U.S. The SP2 has 104 nodes and operates on a 3-TB Informix relational database. Geller uses off-the-shelf desktop and workstation software, including Brio's BrioQuery, to mine the data. The results help MCI refine behavioral models that direct marketing and sales efforts. "The SP2 has boosted the rate of our analysis by an order of magnitude," Geller says. "Queries that once took over 2 hours now take only a minute. This also lets us ask questions that weren't possible before."
A supercomputer's capacity lets a company combine the da
ta-processing needs of diverse departments, often with synergistic results. For example, Macklanburg-Duncan, a manufacturer of building-improvement products, uses a Cray CS6400 Enterprise Server to integrate its customer orders and manufacturing and financial information. The system keeps track of inventory for 4000 different products and helps coordinate shipments to 17,000 customers nationwide.
Previously, this data was on separate mainframes. According to Michael Mack, Macklanburg-Duncan's manager of technical services, consolidating all these operations on the Cray gives the company a fast response time on orders. "An order goes through a series of processes, and if this process releases the order, the shipping form gets printed immediately. The entire operation takes only 30 minutes if an order is transmitted electronically," he says.
Phone orders take a little longer: A customer-service consultant enters an order on-line and has the option of immediately processing it or allowing a periodic
batch job to perform this task. The company warehouses orders so that geographic and purchase-activity information can help direct the company's decision-making. "Because of the Cray's scalability," Mack continues, "we'll be able to add other divisions or acquisitions without having to purchase additional machines. There will be incremental costs to add resources, but the CS6400 gives us the ability to expand as our business grows."
Tall Buildings to Leap
While supercomputer vendors such as Cray Research and SGI have used ingenious techniques to extend SMP's performance, this architecture will eventually reach a performance limit. In the long term, like it or not, future application designs will have to grapple with MPP systems and message-passing schemes.
Unfortunately, as the description of the MPP architecture suggests, writing such parallel code is not easy. A dearth of development tools -- not surprising, given the relative newness of MPP designs -- aggravates the situat
ion. Debugging is especially complex, since it's difficult to isolate a problem connected to a thread that has crashed on a single processor.
In a scene that's similar to desktop computing, supercomputing hardware has outpaced the software. While writing such software will take time, the situation is not that grim: There's plenty of work for supercomputers to do with the software that's currently available.
Supercomputing's transition has required the parallel processes of design simplification and standardization, the use of multiple inexpensive components, and proper marketing -- all similar paths to the computer industry as a whole. Supercomputers have escaped their ivory-tower prisons to help businesses with such mundane-but-crucial tasks as rapid order turnaround and strategic decisions made with the help of data mining.
There seems little doubt that the lessons learned in the stratosphere of computing will eventually appear in desktop machines that routinely use similar multiple-proces
sor architectures. Thus, the final triumph of supercomputing may be its assimilation into desktop computers everywhere.
WHERE TO FIND
Convex Computer Corp.
Richardson, TX
(214) 497-4000
fax: (214) 497-4848
http://www.convex.com
Cray Research
Business Systems Division
Beaverton, OR
(800) 289-2729
(503) 641-3151
fax: (503) 520-7724
CS6400@cray.com
http://www.cray.com
IBM Corp.
Somers, NY
(800) 426-3333
http://lscftp.kgn.ibm.com/pps/
Silicon Graphics, Inc.
Mountain View, CA
(800) 800-7441
(415) 960-1980
fax: (415) 961-0595
http://www.sgi.com