to supply video on-demand for in-house train-ing sessi
ons, manage numerous Internet connections, and help mine information from companywide data warehouses.
These kinds of jobs can quickly overwhelm a server that's little more than a souped-up desktop computer. Next-generation servers must efficiently manage resources on a far larger scale--gigabytes of memory, multiple CPUs, dozens of I/O ports, stacks of disk arrays, and piles of peripherals.
To meet these challenges, Tandem has designed a novel system architecture that ties a server's processors, memory, and peripherals together with a mesh of high-speed connections and smart switches. Like any intelligent network, this architecture allows basic I/O operations, such as memory fetches and peripheral accesses, to explore alternate paths when the system is laboring under heavy loads.
Sequent has designed a new server architecture that's equally clever. It lets you assemble a scalable system from basic building blocks of processors and local memory, all tied together with a h
igh-speed communications link. This architecture also resembles a network.
Tandem and Sequent aren't the only server vendors to see the merits of these architectures. NEC has announced support for Tandem's architecture in its RISC-based Windows NT servers. Compaq and Tandem have jointly announced a PCI board that lets you tie together servers via a special network so they can share resources -- a technique known as
clustering
. Using Windows NT 4.0, this clustering arrangement provides a "fail-over" capability: If one server crashes, the others automatically step in to handle the load.
Tandem's ServerNet
Most servers today are based on a parallel-processing model in which multiple CPUs divide and conquer the work load. There are two general approaches to this model: symmetric multiprocessing (SMP) and massively parallel processing (MPP), and each has its own trade-offs (see "The World's Fastest Computers," January BYTE). Tandem's new architecture, known as Serve
rNet, gives you the flexibility to build either type of system. This is the foundation for Tandem's Integrity S4000 servers.
ServerNet has three components (see the figure
"The ServerNet Architecture"
). The first is a low-cost, high-speed router. Tandem uses several arrays of these routers to construct a packet-switched, point-to-point, interconnected mesh inside the server. This is called the system area network (SAN), to distinguish it from the LAN outside the computer.
The second ServerNet component is a processor-interface chip, which is implemented as an ASIC. This custom chip provides the critical connections between the processors, their local memory, and the router network. With this arrangement, the majority of the processor-to-memory transfers remain local to the processor-interface ASIC -- they don't have to venture onto the SAN.
Also, depending on how the designers arrange the processors and memory, it's possible to build either an SMP or an MPP system wit
h ServerNet. For example, the processor-interface ASIC might share a bank of memory among several processors, thus creating an SMP system. Finally, the processor-interface ASIC is dual-ported, so a server with duplicate SANs can form a fault-tolerant system.
Rounding out ServerNet is the third component, a peripheral-device-interface ASIC. This chip provides SAN connections to communications devices (e.g., external network interfaces) and to standard I/O buses (e.g., PCI and SCSI). All the ServerNet ASICs have built-in error-checking logic, so designers can use them in fault-tolerant servers.
Each SAN router has six bidirectional ports, so designers can arrange the server elements in a wide variety of internal topologies -- including meshes, trees, and hypercubes -- depending on the requirements of the server applications. The routers can rapidly switch data among the SAN's various I/O devices and compute nodes. (A
compute node
consists of one or more processors and local memory.) Playi
ng the role of traffic cop, the routers can transfer data between individual compute nodes, between nodes and I/O devices, and between different I/O devices.
Because all the data moves directly among these elements and not through a central bus, ServerNet is much more efficient than a traditional design. Signal paths are shorter, and the routers can find alternate pathways; thus, there's less chance of a bottleneck during heavy system loads.
The Smart Switch
It's worth looking at the ServerNet routers in more detail to see how they make such a distributed system architecture possible. As mentioned earlier, each router has six bidirectional serial data links, or ports. Each port has two transmit channels and two receive channels; each channel consists of a 9-bit command/data bus and a clock signal. Every channel uses a 9-bit token that encodes 256 data symbols and 20 command symbols. The command symbols are for initialization, error detection, and low-level flow control.
The ports employ differential logic to drive signals through cables up to 30 meters long.
Data packets on the SAN can be up to 80 bytes long. Each packet has an 8-byte header, a 4-byte ServerNet address, a variable-length data payload that can range up to 64 bytes, and a 4-byte checksum (a cyclic redundancy check). The header byte specifies the type of operation to perform (read request, read response, write request, write response, and so on). It also contains routing information--a pair of 20-bit IDs that specify the packet's source and destination--and the length of the data payload. This small packet size reduces network transfer latencies. It also minimizes buffering requirements, resulting in a more economical router design.
The router itself has first-in/first-out (FIFO) buffers for the input data, some arbitration and control logic, a RAM-based routing table, and a 6 by 6 cross-bar switch that links all the data channels. ServerNet uses
wormhole routing
instead of a store-and-f
orward mechanism to further reduce network transfer latencies.
Wormhole routing is a technique in which the router begins forwarding an incoming packet to its destination before the entire packet is received. As the packet's first few header bytes arrive, the router's control logic extracts the destination ID and uses this value as an index into the routing table. The routing table returns the number of the output port that points to the packet's destination router. If that port is busy, the incoming bytes drop into the router's FIFOs. The control logic issues a flow-control command to throttle the sending router until the port becomes available.
In addition, both the processor-interface and peripheral-device ASICs can "pull" (read) as well as "push" (write) data. This accommodates I/O devices with different speeds and buffer sizes by allowing them to pull data from memory as they need it. Pulling data also enables ServerNet to support many active I/O devices simultaneously without resorting to
a multithreaded DMA engine.
Current ServerNet implementations use Tandem's NonStop-UX OS, based on Unix System V release 4.2 MP. In a ServerNet system that implements an SMP architecture, applications programs should run without modification. The OS code and drivers need some work to support the direct availability of all I/O devices to all compute nodes. For ServerNet systems based on an MPP architecture, developers must modify their applications programs to support message passing--a requirement of MPP, not ServerNet.
Thanks in large part to wormhole routing, the latency of a ServerNet router is only 300 nanoseconds per hop. The latency of a zero-length message (e.g., a read response) can be as low as 1 microsecond for a single-level router path. With a full 80-byte packet on a large ServerNet, the latency is 3 µs. The effective data transfer rate between routers is 40 MBps.
But perhaps more significant is the aggregrate transfer rate. Because the routers have multiple interconnect
ions to processor nodes and devices, a ServerNet SAN can deliver spectacular performance. Tandem claims that a system with 4096 compute nodes and peripherals--and with 4680 to 7680 routers arranged in a fractahedral topology--can achieve an aggregate switching bandwidth of 410 GBps. By comparison, some supercomputers have a maximum switching bandwidth of only 1.2 to 2 GBps. (To be fair, these supercomputers achieve this rate with only 32 processors, not 4096.)
ServerNet's distributed architecture, though complex, offers a number of advantages. By providing alternate data paths, it dramatically reduces system bottlenecks and achieves a high switching bandwidth. Since the routers manage data transfers, they relieve the system processors of this job, which is especially significant for I/O. Perhaps most important, the ServerNet architecture offers scalability for different types of server architectures (see the figure
"ServerNet's Flexible Design"
).
MPP systems can use ServerNe
t as a high-speed connection mesh between various compute nodes and peripherals. SMP servers can use ServerNet routers and peripheral-interface ASICs to provide scalable I/O (e.g., increasing the number of Ethernet connections or boosting disk storage) without degrading the performance of the OS or applications software. Because of its point-to-point network capabilities, ServerNet also allows you to expand server capacity by clustering.
Finally, you can use ServerNet to build a fault-tolerant server by connecting two duplicate systems through the dual ports of the processor-interface ASICs. The error-checking logic in the ServerNet ASICs monitors all transfers. If either the processor-interface ASICs or the routers detect an error, they can trigger a recovery protocol that disables the offending component.
Building Blocks
While Tandem takes a distributed interconnection approach among server components to boost server bandwidth, Sequent favors a modular architecture in
which the server is built out of high-performance building blocks mortared together with high-speed communications links. Sequent refers to each block in this structure as a
Quad
, because each one consists of four Pentium Pro CPUs on a multiprocessor system bus. The Quad thus takes advantage of the Pentium Pro's built-in four-way multiprocessor bus, which handles bus arbitration and resource control (see "How to Make Pentium Pros Cooperate," April BYTE).
Each Quad has 512 MB to 4 GB of RAM, seven PCI slots, special communications logic, and 32 MB of cache RAM. The communications logic is the intelligent interconnection between the memory of two or more Quads.
By assembling multiple Quads, you can build a low-latency, scalable SMP server. Of course, if Sequent had stopped there, you'd still have the problem of bus saturation when you added more processors. Sequent's ingenious solution to this is to operate each Quad's multiprocessor bus independently, like the separate buses in an MPP s
ystem.
Unlike in an MPP system, however, the communications logic interconnects each Quad's RAM so that the distributed memory behaves like a global block of shared memory. Sequent refers to this communications logic as "IQ-Link" because it intelligently manages memory I/O. Only when a Quad accesses memory in another Quad does the transaction cross the interconnection bus. Because this bus handles only the occasional global memory access, it can manage a large-scale SMP system with more than eight CPUs.
Note that accesses to a Quad's local memory are fast (250 ns), while off-Quad accesses are slower (3 µs). Because different parts of the system have different memory latencies, this mechanism is known as
nonuniform memory access
, or
NUMA
. The IQ-Link transparently maintains data coherency among all the separate blocks of memory. It's similar to the way a processor's on-chip cache operates, which is why the IQ-Link's mechanism is sometimes described as a
cache-coherent
NUMA (CC NUMA). Sequent calls this NUMA with Quads, or NUMA-Q.
Home on the RAM
Programs tend to access memory in closely related groups of addresses due to tight code loops or sequential searches through data arrays. This behavior is known as
locality of reference
, and it's an important factor in the high performance of a Sequent NUMA-Q. It means that most references won't stray beyond a Quad's local memory and that those references that access global memory will return more quickly, thanks to the high-speed IQ-Link.
Part of the time, in fact, a Quad's Pentium Pro processor will find the data it needs in its primary (L1) or secondary (L2) caches. (Remember, the Pentium Pro has a 256- or 512-KB secondary cache closely coupled to the CPU in a multichip package.) If the CPU can't find what it needs in the L1 or L2 caches, the cache miss will most likely fall within the Quad's local memory. If not, the IQ-Link first searches the Quad's own 32-MB cache, which Se
quent describes as an L3 cache. This is a directory-based cache that holds copies of data from other Quads, and its latency is the same as that of local memory.
Only when the CPU misses the L3 cache does the IQ-Link issue a request on the interconnect bus to access distant memory, as shown in the figure
"Keeping CPUs Cache-Coherent"
. This bus is based on the IEEE 1596-1992 Scalable Coherent Interconnect (SCI) standard. It's a one-way, point-to-point loop that connects the Quads together in a daisy chain. It uses a packet-based protocol that supports cache-coherent distributed/shared memory. In this sense, the SCI bus also resembles a network.
When a Quad's IQ-Link logic receives a request packet over the SCI bus, it fetches the requested data from local memory. The data circles the SCI loop to the Quad that made the request and then arrives in that Quad's L3 cache. Because the address spaces of each Quad's local memory don't overlap, these transfers merely update the Quad's
L3 caches; they don't explicitly copy the data into local memory. To maintain memory coherency, the IQ-Link relays any modifications, such as setting a semaphore, to the appropriate L3 caches of other Quads.
IQ-Link and the SCI bus thus segment a NUMA-Q system's Quads in a manner similar to the way in which a LAN segments a computer network. That is, programs get quick access to the most often-used data (because it's in a Quad's local memory or in the L3 cache), while accesses to less frequently used data need only go through the SCI "backbone." Because off-Quad references are infrequent, the SCI bus remains uncongested; thus, larger SMP systems can be built around a NUMA-Q design.
The data-pump ASICs for the SCI bus are made of gallium arsenide--a more exotic semiconductor than conventional silicon--and can transfer 1 GB of data per second. A Quad's multiprocessor bus can achieve a bandwidth of 500 MBps. Because the NUMA-Q architecture can manage up to 63 Quads (252 processors) using only one i
nstance of the OS, the aggregate system bandwidth of a server can reach nearly 32 GBps.
Not enough power? You can, of course, cluster several NUMA-Q nodes together in an external Ethernet, asynchronous transfer mode (ATM), or Fibre Channel network. In this arrangement, each SMP node runs its own copy of the OS and applications while sharing disks and communications peripherals with other nodes. Since each Quad has a PCI interface, you can distribute peripherals throughout the system to reduce competition for I/O devices. You have a choice of two OSes: DYNIX/ptx (Sequent's SMP-enhanced Unix) or Windows NT.
In a sense, NUMA-Q offers the best features of SMP and MPP architectures. To software, it looks like an SMP system, so existing applications can take advantage of the extra processing power and larger memory space without modification. Threaded OSes can readily distribute tasks among the various processors for load balancing. Yet you get the scalable I/O of an MPP system, because each Quad has
its own I/O ports, and the IQ-Link minimizes systemwide bus traffic while maintaining memory coherency.
The Future Is Networks
If ServerNet and NUMA-Q are any indications of things to come, tomorrow's servers will differ greatly from conventional desktop computers. Their architectures will certainly be much more complex. Instead of using just a couple of buses for memory and device I/O, they will resemble a mesh of interconnected components. However, this complexity will pay off with high performance that will be able to successfully tackle the demands of tomorrow's server applications.
In addition, the servers of tomorrow will be more scalable and custom-tailored for the particular job at hand. With technologies such as ServerNet and NUMA-Q, you can assemble as many elements as you need for a particular job's processing requirements. As your needs grow, you can add more processing power or storage capacity.
Where to Find
Sequent Computer Systems
Beaverton, OR
Phone: (800) 257-9044 or (503) 626-5700
Internet:
http://www.sequent.com/