confront what will be the new market reality. Phase one of Wolfpack's release is scheduled for this month. It will support two-server clusters. The second phase will follow in 1998 and enable clustering more than two servers.
This report is based on tests by both BYTE and NSTL of the second beta release of Wolfpack. In addition, we look at some important issues surroundi
ng clustering technology, many of which involve limitations that have been ignored or glossed over by vendors. Finally, we take a quick survey of the existing products in the market, with a table summarizing their features and a sidebar describing their plans and positions vis-à-vis Wolfpack. (Early on, we planned to conduct a comparative look at cluster solutions, but because
no common
hardware configuration has been feasible, we couldn't conduct BYTE's usual apples-to-apples performance comparisons.) To help you better understand Wolfpack's capabilities and limitations, we'll quickly review the basics of clustering.
Why Cluster?
The whole point of clustering is to maintain "high availability" of computing resources to end users. To do this involves three essential functions: fault tolerance (called
failover
), load balancing, and centralized administration and monitoring. Fault tolerance ensures a backup to replace a failed resource (e.g., server, router,
or network). Load balancing detects when processing overloads one resource to the point that it's virtually unavailable and distributes the load among less-burdened resources. Central management of clustered servers lets administrators monitor and control the cluster from a single console, both to troubleshoot failures and shift resources for routine maintenance.
Unfortunately, most clustering products, including Wolfpack, provide only automatic failover and management. Load balancing is a manual operation, though some third-party systems may provide additional software components or add-on products to help with this.
The heart of any clustering implementation is redundancy. Have two or more of everything, so that if any single resource on the network fails--whether it be a server, server network adapter, disk drive, application, router, or segment--the system will automatically detect this and swap in a standby component. Wolfpack knows about the following NT resource types: Fault-Tolerant Disk Set,
File Share, Generic Application, Generic Service, Internet Information Server (IIS) Virtual Root, IP Address, Network Name, Physical Disk, Print Spooler, and Time Service.
While it's clearly possible to set up a cluster with an extra server standing by, connected to the network but idle, waiting to take over if it's needed, this configuration (called active/passive or asymmetric) is hardly cost-efficient and rarely justifiable. Instead, the usual practice is to have each server active, doing useful work but ready to take over the other's processing if it should fail. In addition to the servers' LAN connections, a second private connection, called the
interconnect
, is usually established so the two servers can monitor each other.
Achieving fault tolerance in a client/server information technology (IT) environment means addressing a number of hardware and software issues: continuing electrical power, multiple servers, redundant data storage, backup network links, and failover management softwar
e.
Power to the Process.
All hardware required for continual services must be connected to an uninterruptible power supply (UPS) that allows time to switch to a backup generator or, if necessary, to conduct a fast but orderly shutdown.
Many Machines.
You can reduce the possibility of downtime simply by dividing tasks up. A Web server on one machine and an e-mail server on another means that one server going down won't cause both applications to fail.
Share the Storage.
Disk mirroring or replication techniques between servers ensure that data--and possibly applications--will be available should a disk drive or server fail. Right now, SCSI is the gold standard for shared-disk technologies, but it has limits (see the sidebar). One of them is that the distance between clustered servers is limited to only 25 meters. Also, non-SCSI failover systems can make the server cluster vulnerable to network partitioning. In the future, technologies such as Fibre Channel, Serial Storage Archit
ecture (SSA), or I
2
O may provide dedicated disk sharing over longer distances.
The Dept. of Redundancy Dept.
Adding an additional connection between servers helps reduce the possibility of communications failure over the network.
Manage the Monster.
Failover management software offers a way to detect hardware and software failures and invoke backup, standby, or takeover technologies. Failure-detection parameters require some fine-tuning by the administrator. A too-sensitive failure test will cause needless switch-overs, but a test that's not sensitive enough risks the loss of services. A redundant dedicated interconnect between servers makes for more reliable failure detection. NSTL technicians had difficulties with NT's deadly "Blue Screen" after trying to uninstall some clustering packages. Thus, it's prudent to make an emergency repair disk prior to installation.
Simple stateless Web services are fairly straightforward to migrate, but stateful applications (e.g., datab
ase applications) are more difficult and may require special add-on kits. For greatest flexibility, failover software should offer an API to let in-house programmers add failover code to custom and homegrown applications.
What Wolfpack Does
To create a Wolfpack-based cluster, you need two (no more, no less) NT 4.0 servers (with Service Pack 3 installed) that share a SCSI bus supporting an external disk-storage subsystem (
see the figure
). Both servers must be members of the same NT domain, and each must have its own system disk on a local, unshared bus.
Wolfpack enables the two servers to exchange their status, resources being run, and activity with each other. Two components of the clustering software are the Cluster Service and the Resource Monitor. The Cluster Service, which runs on every clustered server, controls cluster activity, communication between servers, and failure operations. The Resource Monitor checks the assigned states of targeted resources (i.e., off
-line, off-line pending, on-line, on-line pending, or failed) and reports any state changes to the Cluster Service. Each server can run one or more Resource Monitors.
The primary monitoring communication between Wolfpack nodes is called
heartbeat synchronization
. Basically, each node is always checking whether the other is still there and ticking. If a node's Resource Monitor determines that the other node has disappeared, the Cluster Service executes the predefined failover instructions. Because there is a separate Cluster Service and one or more Resource Monitors on each node, this cluster communication takes the form of interprocess communications (IPC) and requires little network overhead. This traffic is small enough that it can be run over a private Ethernet LAN (usually called an Interconnect), a public LAN, a serial connection, or even the SCSI bus, though the last one isn't recommended.
The administrator can specify two polling intervals and a time-out value for resources. The pollin
g intervals affect how often the Resource Monitor does its checks. There are two levels of polling, known in Wolfpack jargon as Looks Alive and Is Alive. In Looks Alive polling, Wolfpack performs a cursory check to determine if the resource is available and running. Is Alive polling is more thorough, with Wolfpack determining if the resource is fully operational. The time-out value specifies how long the Resource Monitor should wait for a response before it considers the resource failed.
Planning to Fail
The most significant advantage Wolfpack offers over current clustering solutions is its tight integration with NT. For example, Wolfpack lets you group NT resources with applications into failover groups. When a single resource fails, Wolfpack fails over the entire group to which the failing resource belongs. This provides a handy means of creating failover dependencies and ensures that a failed service will have the appropriate resources it needs to restart. Some systems require involved scrip
ts to accomplish what Wolfpack allows via prompted dialog boxes and mouse-clicks.
Automatic failover isn't always possible, unfortunately. Some applications can run on only one node on the cluster and in case of failover would have to be manually started on the other node. Some applications (e.g., IIS, FTP) can be managed and configured to automatically start on the other node in the event of a failover.
Wolfpack's migrating functions and resources to the alternate server, when its cluster cousin fails, let the IT staff troubleshoot and fix the problem. But how do you restore resources to the original, failed-but-fixed server (a process called failback)? Can you, and should you, automate it? It might seem that automatic failback is the best solution, but only if the problem is really fixed and unlikely to recur. If not, automatic failback can cause subsequently failed resources to bounce back and forth between servers, causing problems for users. Restricting failback to a deliberate manual action by I
T personnel can eliminate this ping-pong effect.
Cluster Management
In an ordinary server environment, users employ a number of administrative tools to identify the servers and monitor their contents and activities. Wolfpack uses a single program, the Cluster Administrator, to centralize control over applications and services. You can run it as a client from any NT workstation attached to the cluster. All cluster resources appear as hierarchically organized objects that you can assign and configure with relative ease.
Cluster Administrator manages services, file shares, and directory replication. It allows reviewing the activities and failures of the computers in each cluster to determine which nodes are currently running applications and services. Color denotes resource ownership--that is, the colors change when a failover occurs, an instant notification that also tells you which server owns what resources. Cluster Administrator lets you specify the applications and related components that
run on the servers and establish policies that monitor availability and recovery failure detection. Manually taking individual nodes off-line for maintenance involves only a right mouse-click to fail services and resources over to the other server.
While failover and failback are handled well, load balancing is still a problem under Wolfpack. It's neither automatic nor dynamic; in fact, it's completely a
manual process
. Therefore, you need to carefully monitor cluster loads, because it's possible for one node on the cluster to be serving 200 users and the other node handling only a few clients. And, unfortunately, there may be nothing you can do to fix it.
At BYTE, we installed Wolfpack on two Digital Equipment servers (200- and 166-MHz Pentium systems) sharing a single external SCSI cabinet with two 2-GB hard drives. Setup was quick and easy. The first node creates the cluster--cluster name, IP address, alias information, groups, etc. When the second node joins this existing cl
uster, we could assign resources and define failover procedures.
We tested manual failover (of IIS server, SQL server, and disk resources) by moving resources back and forth using Cluster Administrator. We shut down one node to test automatic failover. In all cases, recovery seemed nearly instantaneous. Cluster Administrator was also smart enough to prevent us from assigning new resources to the now-missing node.
Pick the Pack?
The reality of clustering for NT, right now, is that neither Wolfpack nor any of the available clustering products for NT fully implements all the functions and concepts that BYTE believes constitute true clustering. Available products provide add-on kits to support a short list of programs, mostly databases. Wolfpack adds much of the required functionality directly into the OS and provides common APIs for custom solutions. But if you need to cluster more than two servers, you probably can't wait until Wolfpack grows up some more. Thus, one of the other products, inc
luding some non-NT clustering solutions, may be a better choice. Still, there seems little doubt that Microsoft will soon be the leader of the pack.
Where to Find
Digital Clusters for Windows NT.................$995
Digital Equipment Corp.
Maynard, MA
Phone: 800-344-4825
Internet:
http://www.digital.com/
Enter 1013 on Inquiry Card.