Current supercomputers can be roughly divided into two categories: vector machines and massively parallel machines. The key distinction between the two is that almost all vector supercomputers can be purchased with multiple processors, but parallel supercomputers are dependent on using many processors at once to deal with a single problem.
Parallel machines rarely provide enough performance to handle a grand-challenge application using only one processor at a time. Vector machines, on the other hand, are almost exclusively used as a group of independent processors that share resources. A very small percentage of the applications currently running on vector machines use more than one processor at once.
A Vector Supercomputer: The Cray C90
The C90 comprises a family of related machines, the most powerful of which, the C916, can have between eight and 16 processors. It has a clock cycle of 4.2 nanoseconds; the 15-ns memory is implemented on BiCMOS. A C916 system can have as much as 8 GB of memory.
During each clock cycle, two operands can be loaded from memory, and one can be stored for each pipeline. But due to the latency of the memory subsystem, memory operations must be scheduled properly to achieve maximum throughput. (For applications that require more memory, Cray offers an alternate line called the M90; these systems have lower floating-point performance but can support several times more memory.)
The maximum I/O bandwidth of the C90 is 13.6 GBps; it's handled by a variety of networks. The system can be connected to a solid-state disk (i.e., a large RAM drive) that stores up to 32 GB and supports access at the full I/O bandwidth. Physically, the machine takes
up 48 square feet, and the Freon cooling unit requires another 50 square feet. The system can require more than 300 kilowatts of electrical power to run.
The core of the C90's floating-point performance, which peaks at about 1 GFLOPS per processor, comes from the vector processors. It's up to the programmer and the compiler to see that those processors are used effectively. Over the past few decades, scientific programmers have become used to programming for vector supercomputers and have learned how to write efficient code for them. Although it is rare to have code achieve a sustained throughput of anything close to 1 GFLOPS, a lot of real-world applications achieve hundreds of MFLOPS.
The C90's operating system is UNICOS, a Unix variant. The system comes with highly tuned compilers for various languages (including C and FORTRAN 77). Cray has also built a variety of tools for measuring the performance of an application and discovering inefficiencies or hot spots that need to be optimized.
A Parallel Supercomputer: Intel's Paragon
The Paragon is a descendant of earlier Intel machines. Intel began building parallel hypercube systems during the mid-1980s and then moved to a two-dimensional mesh with its Touchstone Delta.
The Paragon is similar in design to the Delta, but it uses faster, 50-MHz i860/XP processors with built-in support for network communications. Each processor can have up to 128 MB. Routing communications between processors through the mesh is handled by separate network chips; the bisection bandwidth ranges from less than 1 GBps all the way up to several GBps, depending on the machine's configuration.
I/O is performed through a HiPPI (High-Performance Parallel Interface) that supports up to 100 MBps. For comparison, the Cray C90 supplies not only a HiPPI but also a variety of other interfaces that can support as much as 1.8 GFLOPS per channel.
I/O performance is often an Achilles' heel for parallel machines. This is particularly tru
e of a system like the Paragon, which can be configured to provide much higher theoretical CPU performance than even the biggest Cray vector machine.
For some applications, the Paragon attains extremely high performance. For instance, a 3680-processor Paragon achieves 143 GFLOPS on the LINPACK benchmark, as compared to just 13.7 GFLOPS for a 16-processor C90.
But achieving such performance on a massively parallel machine is difficult. At present, virtually every application that executes efficiently on massively parallel systems is hand-coded; the programmer directly specifies the data that is to be communicated between nodes using message-passing primitives. Intel provides libraries of tuned communications routines and tools to aid in performance monitoring and debugging on the Paragon, but the process is far from painless.
Cray C916
photo_link (17 Kbytes)