Proper scheduling makes a world of difference
David Yavin
Lotus promises that Notes will change the way companies do business, and judging by the product's explosive market growth, that promise is being taken very seriously indeed. But while Notes users enjoy increasingly pleasant user interfaces and sophisticated applications, network managers are discovering--the hard way--that running Notes efficiently is not always a simple task. In this discussion, I'll focus on an often-underrated cause of suboptimal performance: ineffective replication scheduling.
As a pioneer in distributed database technology, Notes may be the first and most visible product to suffer from large-scale scheduling woes. But with the advent of
Novell's NetWare Directory Service and replication services for database servers from Oracle and Sybase, the issue of replication scheduling is moving to the forefront. Buyers and vendors alike should raise their level of awareness about this potentially critical issue.
From Concept to Reality
Today's global organizations need to provide their geographically distributed workgroups with the ability to access and manipulate shared information. But because of technical limitations and telecommunications costs, reliance on remote access to centrally located data is often not feasible.
Lotus Notes provides a conceptually simple and elegant solution to this problem: Identical copies of the shared Notes databases are distributed so that users do the bulk of their work on their local copies. What makes this solution work is the concept of replication, a process in which a pair of servers (or a server and a workstation) communicate to synchronize their respective copies of shared databases.
The concept is a simple one, but the road to effective implementation can be rocky. Notes is not immune to the usual WAN (wide-area network) ailments: international incompatibility of modems, unanticipated network congestion, and insufficient bandwidth. Fortunately, network managers know how to deal with these problems. But they've yet to acquire much experience with another problem that's critical to the health of a Notes network--replication scheduling.
Data propagation between servers is governed primarily by a fixed, network-wide replication schedule. That schedule, and the logical topology it implies, dictates how updates flow through the system. Organizations often underestimate the impact that the topology and replication schedule can have on the performance of Notes as an enterprise-wide application.
A well-planned flow, tailored to your specific infrastructure and usage patterns, avoids uneven server workloads and excessive network congestion. Topology and a replication schedule can be
decisive factors in determining whether your Notes network can propagate shared information quickly enough to support the business processes that you want to automate. What follows is a scenario involving two organizations and their respective approaches to their replication problems.
Tales from the Trenches
Organization 1 is a round-the-clock operation. Its 75 offices, which are scattered around the world, are interconnected by the company's private network. Each office runs a Notes server. Rapid distribution of new information has always been crucial to many of the organization's business processes.
Organization 2 is a conglomerate of 15 loosely affiliated franchises located throughout Europe and the U.S. The franchises communicate over modems and rely on large volumes of shared (and frequently updated) information. Over time, several hundred Notes databases have evolved in this organization. Many of these databases are distributed to most or all of the servers, and some contain very la
rge, frequently updated documents.
Organization 1 wanted a replication scheme that would propagate any update as soon as possible. Since replication between two servers transfers only those changes that have occurred since the two servers last replicated, and since communications costs on the private network were not an issue, organization officials figured they could replicate as frequently as they wished. They decided to use a hub-and-spoke topology. A central hub would initiate replications with all other servers every 45 minutes.
Organization 2 set up continental hubs, each one replicating with all the servers in its continent; the two continental hubs also replicated with each other. Replication scheduling was done more or less on an ad hoc basis, with no real master plan and no central control. The organization ended up with one daily replication scheduled between the hubs and some spokes, and two or three daily replications between the hubs and some other spokes. It was decided that the m
ember of a pair of replicating servers to initiate the replication would be the one located in the office responsible for the phone bills.
Problems and Solutions
There was one basic problem with Organization 1's reasoning: the overhead involved in Notes replication. Replications--even empty replications that pull no data--are far from free. Both servers must expend a significant amount of time and resources just to identify whether any updates need to be pulled from the other server. It can easily take 30 seconds (and sometimes much longer) just to determine that nothing needs to be done, a burden when this process must occur more than 100 times per hour.
As a result, the hub was frequently failing to complete a calling cycle before it had to start a new round all over again, and some unfortunate spokes were consistently dropping off the end of the list and not getting replications. Moreover, all that replication activity put a significant strain on the hub and caused serious congestion.
Organization 2's replications were far from empty. Because of the large volume of data, the slow lines, and the replication overhead involved, a typical replication would last anywhere from 5 minutes to an hour, and sometimes longer. Insufficient awareness of the duration of replications throughout the system had caused many replications to be scheduled too close together; consequently, they were often lost or significantly delayed. Moreover, with replications being initiated by servers in five different time zones according to calling schedules that were not centrally coordinated--partly because Notes provides no tools for managing cross-time-zone scheduling--there were many scheduling conflicts; again, this resulted in lost or delayed replications.
Clearly, Organization 1 needed to cut back on the number of replications being made and take some of the load off the main hub. A comprehensive per-server and per-database analysis of replication showed that the highly active databases were not repli
cating to all the servers. The frequency of replications was therefore cut back wherever possible, and another hub was brought in to share the effort with the original hub.
Organization 2 needed a new, centrally planned schedule. A comprehensive analysis of the duration of replications with the various spokes helped indicate how large a window needed to be reserved for each replication to avoid conflicts. It also showed that the European hub could not efficiently handle replication with all its spokes, so regional hubs were introduced to help distribute updates more efficiently. The system was carefully mapped out, and, taking into consideration such issues as time-zone differences and usage patterns, a new, more efficient, and more reliable schedule was designed.
From Reliable Replication to Optimized Propagation
Successful tuning of a broken replication system is worth a pat on the back. But ensuring that updates get where they're needed is not enough. Updates must also arrive when they
're needed. Again, the right topology and replication schedule are the key factors.
It's important to realize that design and maintenance of your replication scheme is neither a side issue nor a one-time thing. It is no less important than server administration, network management, and applications development.
While you may need to seek outside help, especially for major overhauls, you should aspire to breed (or acquire) in-house expertise so that you can stay on top of your evolving Notes network. In addition, users of Notes and other distributed-data products have a right to expect much more powerful tools for analyzing and modeling data flows than are now available.
If your Notes network seems to be dragging its feet, it may be wearing the wrong topology/scheduling shoes. Custom-fit it with as good a pair as you can, and your organization and its Notes network just might start dancing to a smoother tune.
Illustration: The Impact of Topology and Scheduling -- A Simple Exampl
e
Given hourly replications that never last more than 1 hour, how long does it take to propagate an update from one server to the others? With a hub-and-spoke topology (left), the best case is between 2 and 3 hours, and the worst case is between 5 and 6 hours, yielding an average propagation time of 4 hours. But a square topology (right) is the optimal solution for four servers: The best case is between 1 and 2 hours, and the worst case is between 2 and 3 hours, for an average time of just 2 hours.
David Yavin holds a Ph.D. in mathematics from MIT in the fields of topology and combinatorics. Formerly a research fellow at the Max-Planck Institute of Mathematics in Bonn, Germany, he is president of DYS Analytics (Newton, MA) and has worked as a consultant to some of the world's largest Notes users on issues of topology and replication scheduling. He can be reached on the Internet as
david@math.mit.edu
or on BIX c/o ``editors.''