Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesThe Network in the Server


July 1996 / Features / The Network in the Server

Leading-edge server designs are adopting network-like system architectures to boost performance and improve scalability.

Tom Thompson

Multiple processors linked with high-speed data pathways--sounds like standard networking fare, right? Wrong. There's nothing standard about the types of networks that are appearing from a growing number of server vendors: These networks are inside the computer.

Two new system architectures, from Tandem Computers and Sequent Computer Systems, typify this hot trend in server design. Just in time, too. Today's servers must often store gigabytes of complex data sets and juggle thousands of on-line transactions or queries each hour. Many businesses also expect their servers to supply video on-demand for in-house train-ing sessi ons, manage numerous Internet connections, and help mine information from companywide data warehouses.

These kinds of jobs can quickly overwhelm a server that's little more than a souped-up desktop computer. Next-generation servers must efficiently manage resources on a far larger scale--gigabytes of memory, multiple CPUs, dozens of I/O ports, stacks of disk arrays, and piles of peripherals.

To meet these challenges, Tandem has designed a novel system architecture that ties a server's processors, memory, and peripherals together with a mesh of high-speed connections and smart switches. Like any intelligent network, this architecture allows basic I/O operations, such as memory fetches and peripheral accesses, to explore alternate paths when the system is laboring under heavy loads.

Sequent has designed a new server architecture that's equally clever. It lets you assemble a scalable system from basic building blocks of processors and local memory, all tied together with a h igh-speed communications link. This architecture also resembles a network.

Tandem and Sequent aren't the only server vendors to see the merits of these architectures. NEC has announced support for Tandem's architecture in its RISC-based Windows NT servers. Compaq and Tandem have jointly announced a PCI board that lets you tie together servers via a special network so they can share resources -- a technique known as clustering . Using Windows NT 4.0, this clustering arrangement provides a "fail-over" capability: If one server crashes, the others automatically step in to handle the load.

Tandem's ServerNet

Most servers today are based on a parallel-processing model in which multiple CPUs divide and conquer the work load. There are two general approaches to this model: symmetric multiprocessing (SMP) and massively parallel processing (MPP), and each has its own trade-offs (see "The World's Fastest Computers," January BYTE). Tandem's new architecture, known as Serve rNet, gives you the flexibility to build either type of system. This is the foundation for Tandem's Integrity S4000 servers.

ServerNet has three components (see the figure "The ServerNet Architecture" ). The first is a low-cost, high-speed router. Tandem uses several arrays of these routers to construct a packet-switched, point-to-point, interconnected mesh inside the server. This is called the system area network (SAN), to distinguish it from the LAN outside the computer.

The second ServerNet component is a processor-interface chip, which is implemented as an ASIC. This custom chip provides the critical connections between the processors, their local memory, and the router network. With this arrangement, the majority of the processor-to-memory transfers remain local to the processor-interface ASIC -- they don't have to venture onto the SAN.

Also, depending on how the designers arrange the processors and memory, it's possible to build either an SMP or an MPP system wit h ServerNet. For example, the processor-interface ASIC might share a bank of memory among several processors, thus creating an SMP system. Finally, the processor-interface ASIC is dual-ported, so a server with duplicate SANs can form a fault-tolerant system.

Rounding out ServerNet is the third component, a peripheral-device-interface ASIC. This chip provides SAN connections to communications devices (e.g., external network interfaces) and to standard I/O buses (e.g., PCI and SCSI). All the ServerNet ASICs have built-in error-checking logic, so designers can use them in fault-tolerant servers.

Each SAN router has six bidirectional ports, so designers can arrange the server elements in a wide variety of internal topologies -- including meshes, trees, and hypercubes -- depending on the requirements of the server applications. The routers can rapidly switch data among the SAN's various I/O devices and compute nodes. (A compute node consists of one or more processors and local memory.) Playi ng the role of traffic cop, the routers can transfer data between individual compute nodes, between nodes and I/O devices, and between different I/O devices.

Because all the data moves directly among these elements and not through a central bus, ServerNet is much more efficient than a traditional design. Signal paths are shorter, and the routers can find alternate pathways; thus, there's less chance of a bottleneck during heavy system loads.

The Smart Switch

It's worth looking at the ServerNet routers in more detail to see how they make such a distributed system architecture possible. As mentioned earlier, each router has six bidirectional serial data links, or ports. Each port has two transmit channels and two receive channels; each channel consists of a 9-bit command/data bus and a clock signal. Every channel uses a 9-bit token that encodes 256 data symbols and 20 command symbols. The command symbols are for initialization, error detection, and low-level flow control. The ports employ differential logic to drive signals through cables up to 30 meters long.

Data packets on the SAN can be up to 80 bytes long. Each packet has an 8-byte header, a 4-byte ServerNet address, a variable-length data payload that can range up to 64 bytes, and a 4-byte checksum (a cyclic redundancy check). The header byte specifies the type of operation to perform (read request, read response, write request, write response, and so on). It also contains routing information--a pair of 20-bit IDs that specify the packet's source and destination--and the length of the data payload. This small packet size reduces network transfer latencies. It also minimizes buffering requirements, resulting in a more economical router design.

The router itself has first-in/first-out (FIFO) buffers for the input data, some arbitration and control logic, a RAM-based routing table, and a 6 by 6 cross-bar switch that links all the data channels. ServerNet uses wormhole routing instead of a store-and-f orward mechanism to further reduce network transfer latencies.

Wormhole routing is a technique in which the router begins forwarding an incoming packet to its destination before the entire packet is received. As the packet's first few header bytes arrive, the router's control logic extracts the destination ID and uses this value as an index into the routing table. The routing table returns the number of the output port that points to the packet's destination router. If that port is busy, the incoming bytes drop into the router's FIFOs. The control logic issues a flow-control command to throttle the sending router until the port becomes available.

In addition, both the processor-interface and peripheral-device ASICs can "pull" (read) as well as "push" (write) data. This accommodates I/O devices with different speeds and buffer sizes by allowing them to pull data from memory as they need it. Pulling data also enables ServerNet to support many active I/O devices simultaneously without resorting to a multithreaded DMA engine.

Current ServerNet implementations use Tandem's NonStop-UX OS, based on Unix System V release 4.2 MP. In a ServerNet system that implements an SMP architecture, applications programs should run without modification. The OS code and drivers need some work to support the direct availability of all I/O devices to all compute nodes. For ServerNet systems based on an MPP architecture, developers must modify their applications programs to support message passing--a requirement of MPP, not ServerNet.

Thanks in large part to wormhole routing, the latency of a ServerNet router is only 300 nanoseconds per hop. The latency of a zero-length message (e.g., a read response) can be as low as 1 microsecond for a single-level router path. With a full 80-byte packet on a large ServerNet, the latency is 3 µs. The effective data transfer rate between routers is 40 MBps.

But perhaps more significant is the aggregrate transfer rate. Because the routers have multiple interconnect ions to processor nodes and devices, a ServerNet SAN can deliver spectacular performance. Tandem claims that a system with 4096 compute nodes and peripherals--and with 4680 to 7680 routers arranged in a fractahedral topology--can achieve an aggregate switching bandwidth of 410 GBps. By comparison, some supercomputers have a maximum switching bandwidth of only 1.2 to 2 GBps. (To be fair, these supercomputers achieve this rate with only 32 processors, not 4096.)

ServerNet's distributed architecture, though complex, offers a number of advantages. By providing alternate data paths, it dramatically reduces system bottlenecks and achieves a high switching bandwidth. Since the routers manage data transfers, they relieve the system processors of this job, which is especially significant for I/O. Perhaps most important, the ServerNet architecture offers scalability for different types of server architectures (see the figure "ServerNet's Flexible Design" ).

MPP systems can use ServerNe t as a high-speed connection mesh between various compute nodes and peripherals. SMP servers can use ServerNet routers and peripheral-interface ASICs to provide scalable I/O (e.g., increasing the number of Ethernet connections or boosting disk storage) without degrading the performance of the OS or applications software. Because of its point-to-point network capabilities, ServerNet also allows you to expand server capacity by clustering.

Finally, you can use ServerNet to build a fault-tolerant server by connecting two duplicate systems through the dual ports of the processor-interface ASICs. The error-checking logic in the ServerNet ASICs monitors all transfers. If either the processor-interface ASICs or the routers detect an error, they can trigger a recovery protocol that disables the offending component.

Building Blocks

While Tandem takes a distributed interconnection approach among server components to boost server bandwidth, Sequent favors a modular architecture in which the server is built out of high-performance building blocks mortared together with high-speed communications links. Sequent refers to each block in this structure as a Quad , because each one consists of four Pentium Pro CPUs on a multiprocessor system bus. The Quad thus takes advantage of the Pentium Pro's built-in four-way multiprocessor bus, which handles bus arbitration and resource control (see "How to Make Pentium Pros Cooperate," April BYTE).

Each Quad has 512 MB to 4 GB of RAM, seven PCI slots, special communications logic, and 32 MB of cache RAM. The communications logic is the intelligent interconnection between the memory of two or more Quads.

By assembling multiple Quads, you can build a low-latency, scalable SMP server. Of course, if Sequent had stopped there, you'd still have the problem of bus saturation when you added more processors. Sequent's ingenious solution to this is to operate each Quad's multiprocessor bus independently, like the separate buses in an MPP s ystem.

Unlike in an MPP system, however, the communications logic interconnects each Quad's RAM so that the distributed memory behaves like a global block of shared memory. Sequent refers to this communications logic as "IQ-Link" because it intelligently manages memory I/O. Only when a Quad accesses memory in another Quad does the transaction cross the interconnection bus. Because this bus handles only the occasional global memory access, it can manage a large-scale SMP system with more than eight CPUs.

Note that accesses to a Quad's local memory are fast (250 ns), while off-Quad accesses are slower (3 µs). Because different parts of the system have different memory latencies, this mechanism is known as nonuniform memory access , or NUMA . The IQ-Link transparently maintains data coherency among all the separate blocks of memory. It's similar to the way a processor's on-chip cache operates, which is why the IQ-Link's mechanism is sometimes described as a cache-coherent NUMA (CC NUMA). Sequent calls this NUMA with Quads, or NUMA-Q.

Home on the RAM

Programs tend to access memory in closely related groups of addresses due to tight code loops or sequential searches through data arrays. This behavior is known as locality of reference , and it's an important factor in the high performance of a Sequent NUMA-Q. It means that most references won't stray beyond a Quad's local memory and that those references that access global memory will return more quickly, thanks to the high-speed IQ-Link.

Part of the time, in fact, a Quad's Pentium Pro processor will find the data it needs in its primary (L1) or secondary (L2) caches. (Remember, the Pentium Pro has a 256- or 512-KB secondary cache closely coupled to the CPU in a multichip package.) If the CPU can't find what it needs in the L1 or L2 caches, the cache miss will most likely fall within the Quad's local memory. If not, the IQ-Link first searches the Quad's own 32-MB cache, which Se quent describes as an L3 cache. This is a directory-based cache that holds copies of data from other Quads, and its latency is the same as that of local memory.

Only when the CPU misses the L3 cache does the IQ-Link issue a request on the interconnect bus to access distant memory, as shown in the figure "Keeping CPUs Cache-Coherent" . This bus is based on the IEEE 1596-1992 Scalable Coherent Interconnect (SCI) standard. It's a one-way, point-to-point loop that connects the Quads together in a daisy chain. It uses a packet-based protocol that supports cache-coherent distributed/shared memory. In this sense, the SCI bus also resembles a network.

When a Quad's IQ-Link logic receives a request packet over the SCI bus, it fetches the requested data from local memory. The data circles the SCI loop to the Quad that made the request and then arrives in that Quad's L3 cache. Because the address spaces of each Quad's local memory don't overlap, these transfers merely update the Quad's L3 caches; they don't explicitly copy the data into local memory. To maintain memory coherency, the IQ-Link relays any modifications, such as setting a semaphore, to the appropriate L3 caches of other Quads.

IQ-Link and the SCI bus thus segment a NUMA-Q system's Quads in a manner similar to the way in which a LAN segments a computer network. That is, programs get quick access to the most often-used data (because it's in a Quad's local memory or in the L3 cache), while accesses to less frequently used data need only go through the SCI "backbone." Because off-Quad references are infrequent, the SCI bus remains uncongested; thus, larger SMP systems can be built around a NUMA-Q design.

The data-pump ASICs for the SCI bus are made of gallium arsenide--a more exotic semiconductor than conventional silicon--and can transfer 1 GB of data per second. A Quad's multiprocessor bus can achieve a bandwidth of 500 MBps. Because the NUMA-Q architecture can manage up to 63 Quads (252 processors) using only one i nstance of the OS, the aggregate system bandwidth of a server can reach nearly 32 GBps.

Not enough power? You can, of course, cluster several NUMA-Q nodes together in an external Ethernet, asynchronous transfer mode (ATM), or Fibre Channel network. In this arrangement, each SMP node runs its own copy of the OS and applications while sharing disks and communications peripherals with other nodes. Since each Quad has a PCI interface, you can distribute peripherals throughout the system to reduce competition for I/O devices. You have a choice of two OSes: DYNIX/ptx (Sequent's SMP-enhanced Unix) or Windows NT.

In a sense, NUMA-Q offers the best features of SMP and MPP architectures. To software, it looks like an SMP system, so existing applications can take advantage of the extra processing power and larger memory space without modification. Threaded OSes can readily distribute tasks among the various processors for load balancing. Yet you get the scalable I/O of an MPP system, because each Quad has its own I/O ports, and the IQ-Link minimizes systemwide bus traffic while maintaining memory coherency.

The Future Is Networks

If ServerNet and NUMA-Q are any indications of things to come, tomorrow's servers will differ greatly from conventional desktop computers. Their architectures will certainly be much more complex. Instead of using just a couple of buses for memory and device I/O, they will resemble a mesh of interconnected components. However, this complexity will pay off with high performance that will be able to successfully tackle the demands of tomorrow's server applications.

In addition, the servers of tomorrow will be more scalable and custom-tailored for the particular job at hand. With technologies such as ServerNet and NUMA-Q, you can assemble as many elements as you need for a particular job's processing requirements. As your needs grow, you can add more processing power or storage capacity.


Where to Find


Sequent Computer Systems

Beaverton, OR
Phone:    (800) 257-9044 or (503) 626-5700
Internet: 
http://www.sequent.com/


Tandem Computers

Cupertino, CA
Phone:    (408) 285-6000
E-Mail:   
info@tandem.com

Internet: 
http://www.tandem.com/


HotBYTEs
 - information on products covered or advertised in BYTE


The ServerNet Architecture

illustration_link (33 Kbytes)


ServerNet's Flexible Design

illustration_link (28 Kbytes)


Keeping CPUs Cache-Coherent

illustration_link (36 Kbytes)


Tom Thompson is a BYTE senior technical editor at large. He has a B.S.E.E. degree from the University of Memphis and is author of the book PowerPC Programming Kit (Hayden Books, 1996). You can reach him by sending e-mail to tom_thompson@bix.com .

Up to the Features section contentsSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network