sts, running such applications without any fault-tolerance support can be risky -- even dangerous.
It is possible, however, to use software to implement a certain level of fault tolerance for desktop PCs. A special library that we've written, called WinFT, provides fault-tolerance support for Win-32 applications. WinFT performs automatic detection and restarting of failed processes; diagnosis and rebooting of a malfunctioning or strangled OS; checkpointing and recovery of critical volatile data; and preventive actions, such as software rejuvenation (i.e., when an application or OS is restarted to get a clean internal state).
WinFT Parts
WinFT is a library of functions that provides fault-tolerance support for Windows applications that must run for long periods of time or nonstop applications, such as database servers and control systems. WinFT was implemented as a set of objects developed with version
5.0 of the Borland C++ compiler. It uses the Win 95 subset of the Win32 API and is available as both a static and a dynamic link library (DLL). The modules and related functions that make up the WinFT library are shown in the table
"WinFT Functions"
.
The checkpointing modules set up and keep track of critical data structures that the programmer declares. They also keep track of the declared data structures and manage the task of saving this data to disk. In addition, the modules recover the data when the program restarts after a crash.
The watchdog modules set up processes that monitor the activity of other processes to check for execution problems. A mission-critical application uses WinFT's message-passing functions to indirectly signal the watchdog process if it's caught in a loop or stalled waiting on an OS call. If you don't want to bother with the task of writing exception-handling code within your applications, the WinFT exception-handling module provides a simple API fo
r your use.
WinFT Setup
A
TChkp
object, which manages data checkpointing, is used throughout an application. You create this object with the
TChkp()
function. The function's first parameter lets you specify a directory path and the name of the checkpoint file that stores the critical data. If you want to save a copy of the checkpoint file on a remote system, you specify this path in
TChkp()
's second parameter. This lets you restart the application from a backup PC in case of a serious crash.
The three checkpointing methods
Critical()
,
CheckPoint
, and
Recover()
should be used according to some rules that are intuitive but should be clearly stated. The first step is to declare the critical data structures using
Critical()
. You invoke this method as many times as necessary, providing the function with a pointer to the critical data structure and its size in bytes.
You call
CheckPoint
to save all critical da
ta structures on disk and use
Recover()
to refill these data structures using data from the file. After an application starts and all the critical structures are declared, you must determine whether the application should recover data from the disk. This is in case the application is restarted due to an error. A
GetStartMode()
method, described later, fetches this information from a
TWDClient
object that's managed by a watchdog process.
Let Loose the Watchdog
The watchdog can be a separate process, with a graphical interface that lets you set up all the parameters interactively, or a hidden process launched through
InitWatchD()
.
InitWatchD()
creates the
TWDClient
object and a daemon-style process that monitors the health of a specific application. When the application is first launched,
InitWatchD()
checks to see if it was launched directly by the user. If so,
InitWatchD()
never returns, but instead launches anoth
er instance of the application.
The parent process then turns itself into a watchdog process that monitors the newly launched application. When the second instance of the application calls
InitWatchD()
, it detects that the application was launched by the watchdog. In this case,
InitWatchD()
returns, and real application code executes.
Within the application, you use WinFT's watchdog methods
ImAlive()
,
Error()
, and
Idle()
to send messages to the watchdog process. These messages can be "I'm Alives" (also called heartbeats), error messages, or idle notifications, and the watchdog daemon handles them in different ways. Periodic heartbeat messages tell the watchdog that the application is active. Error notifications cover those situations where the application is still active but is detecting errors and having problems getting a job done.
Finally, if the application is idle -- due to user inactivity or the absence of client requests (if it's a server appl
ication) -- it should send idle messages to the watchdog so that it can initiate maintenance or preventive actions, such as software rejuvenation. The figure
"Watchdog and Application Messages"
depicts the interaction between a Windows application and the watchdog process and between the watchdog and the OS. The application uses
GetStartMode()
to see why the watchdog started it and decides if recovery actions, such as reading the checkpoint files into memory, should be performed.
Recovery Routes
The watchdog takes two actions, based on data received from the user application. The first is a simple relaunch of the application when a specified threshold of successive error messages is reached or because the "I'm Alive" time-out expires (i.e., the application is hung up).
The other, more drastic, action is to reboot the machine when successive application restarts fail to clear the problem. This typically occurs when an application keeps reporting OS er
rors and rejuvenation of the OS is a likely solution. To guarantee that the application launches normally after a system reboot, you should place the application's executable file in the Windows Startup Folder.
WinFT was used successfully in the field as a support library for an industrial-control application running under Win 95. The application's availability was increased, and it was able to provide nonstop real-time service.
WinFT seems to be a promising solution for the increasing number of applications that need to run perpetually, such as control systems and servers in client/server applications. It is publicly available from the BYTE Web site (
http://www.byte.com/art/download/download.htm
), and the latest updates are available from the Dependable Systems Group Web page (
http://dsg.dei.uc.pt
).