Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesFault Tolerance for Windows Appli cations


February 1997 / Core Technologies / Fault Tolerance for Windows Applications

A library offers reliable operation for Windows 95 applications.

João Carreira, Diamantino Costa,and João Gabriel Silva

An increasing number of business-critical applications are being targeted for the low-end PC market and general-purpose OSes, such as Windows 95 and NT. This class of programs includes on-line transaction processing (OLTP) and data-warehousing applications, control systems, and other business-critical solutions for the finance, telecommunications, retail, and health-care markets.

But moving to such systems poses a problem. How can the PC guarantee the continuous availability and data consistency required for these mission-critical a pplications? Despite the desktop PC's lower co sts, running such applications without any fault-tolerance support can be risky -- even dangerous.

It is possible, however, to use software to implement a certain level of fault tolerance for desktop PCs. A special library that we've written, called WinFT, provides fault-tolerance support for Win-32 applications. WinFT performs automatic detection and restarting of failed processes; diagnosis and rebooting of a malfunctioning or strangled OS; checkpointing and recovery of critical volatile data; and preventive actions, such as software rejuvenation (i.e., when an application or OS is restarted to get a clean internal state).

WinFT Parts

WinFT is a library of functions that provides fault-tolerance support for Windows applications that must run for long periods of time or nonstop applications, such as database servers and control systems. WinFT was implemented as a set of objects developed with version 5.0 of the Borland C++ compiler. It uses the Win 95 subset of the Win32 API and is available as both a static and a dynamic link library (DLL). The modules and related functions that make up the WinFT library are shown in the table "WinFT Functions" .

The checkpointing modules set up and keep track of critical data structures that the programmer declares. They also keep track of the declared data structures and manage the task of saving this data to disk. In addition, the modules recover the data when the program restarts after a crash.

The watchdog modules set up processes that monitor the activity of other processes to check for execution problems. A mission-critical application uses WinFT's message-passing functions to indirectly signal the watchdog process if it's caught in a loop or stalled waiting on an OS call. If you don't want to bother with the task of writing exception-handling code within your applications, the WinFT exception-handling module provides a simple API fo r your use.

WinFT Setup

A TChkp object, which manages data checkpointing, is used throughout an application. You create this object with the TChkp() function. The function's first parameter lets you specify a directory path and the name of the checkpoint file that stores the critical data. If you want to save a copy of the checkpoint file on a remote system, you specify this path in TChkp() 's second parameter. This lets you restart the application from a backup PC in case of a serious crash.

The three checkpointing methods Critical() , CheckPoint , and Recover() should be used according to some rules that are intuitive but should be clearly stated. The first step is to declare the critical data structures using Critical() . You invoke this method as many times as necessary, providing the function with a pointer to the critical data structure and its size in bytes.

You call CheckPoint to save all critical da ta structures on disk and use Recover() to refill these data structures using data from the file. After an application starts and all the critical structures are declared, you must determine whether the application should recover data from the disk. This is in case the application is restarted due to an error. A GetStartMode() method, described later, fetches this information from a TWDClient object that's managed by a watchdog process.

Let Loose the Watchdog

The watchdog can be a separate process, with a graphical interface that lets you set up all the parameters interactively, or a hidden process launched through InitWatchD() . InitWatchD() creates the TWDClient object and a daemon-style process that monitors the health of a specific application. When the application is first launched, InitWatchD() checks to see if it was launched directly by the user. If so, InitWatchD() never returns, but instead launches anoth er instance of the application.

The parent process then turns itself into a watchdog process that monitors the newly launched application. When the second instance of the application calls InitWatchD() , it detects that the application was launched by the watchdog. In this case, InitWatchD() returns, and real application code executes.

Within the application, you use WinFT's watchdog methods ImAlive() , Error() , and Idle() to send messages to the watchdog process. These messages can be "I'm Alives" (also called heartbeats), error messages, or idle notifications, and the watchdog daemon handles them in different ways. Periodic heartbeat messages tell the watchdog that the application is active. Error notifications cover those situations where the application is still active but is detecting errors and having problems getting a job done.

Finally, if the application is idle -- due to user inactivity or the absence of client requests (if it's a server appl ication) -- it should send idle messages to the watchdog so that it can initiate maintenance or preventive actions, such as software rejuvenation. The figure "Watchdog and Application Messages" depicts the interaction between a Windows application and the watchdog process and between the watchdog and the OS. The application uses GetStartMode() to see why the watchdog started it and decides if recovery actions, such as reading the checkpoint files into memory, should be performed.

Recovery Routes

The watchdog takes two actions, based on data received from the user application. The first is a simple relaunch of the application when a specified threshold of successive error messages is reached or because the "I'm Alive" time-out expires (i.e., the application is hung up).

The other, more drastic, action is to reboot the machine when successive application restarts fail to clear the problem. This typically occurs when an application keeps reporting OS er rors and rejuvenation of the OS is a likely solution. To guarantee that the application launches normally after a system reboot, you should place the application's executable file in the Windows Startup Folder.

WinFT was used successfully in the field as a support library for an industrial-control application running under Win 95. The application's availability was increased, and it was able to provide nonstop real-time service.

WinFT seems to be a promising solution for the increasing number of applications that need to run perpetually, such as control systems and servers in client/server applications. It is publicly available from the BYTE Web site ( http://www.byte.com/art/download/download.htm ), and the latest updates are available from the Dependable Systems Group Web page ( http://dsg.dei.uc.pt ).


WinFT Functions

Checkpointing    TChkp::TChkp(char *name, char *apath,    Create checkpointing
                 int mode, int amode, BOOL *error)        object, assign backup
                                                          data files

                 BOOL TChkp::CheckPoint                   Declare critical data
                                                          structures

                 BOOL TChkp::Critical                     Save data to disk
                 (void *data, int size)

                 BOOL TChkp::Recover()                    Restore data after a
                                                          system crash

Watchdog         TWDClient * InitWatchD(uint              Create watchdog 
functions        id
le_cnt2_rejuv, uint errors_2_restart,  process
                 uint time_ref, uint restarts_2_reboot) 

                 void CloseWatchD()                       Terminate watchdog
                                                          monitoring

                 BOOL TWDClient::ImAlive(void)            Send periodic "I'm 
                                                          Alive" messages to 
                                                          watchdog process

                 BOOL TWDClient::Error(int err_code)      Send error messages
                                                          to watchdog process

                 BOOL TWDClient::Idle(void)               Send idle messages to
                                                          watchdog process

                 BOOL TWDClient::SetImAliveTimeout        Assign time-out 
                 (uint timeout)                           interval to detect
 
                                                          hung processes

                 int TWDClient::GetStartMode(void)        Determine cause of
                                                          application start and
                                                          perform recovery if 
                                                          necessary

Exception        BOOL InitXceptionHandling()              Exception-handling 
handling                                                  function and macros




Watchdog and Application Messages

illustration_link (48 Kbytes)

A message-passing scheme tells the watchdog if the process is hung up or is experiencing problems.


João Carreira, Diamantino Costa, and João Gabriel Silva work at the Dependable Systems Group in the Department of Informatics Engineering at the University of Coimbra in Portugal. João Carreira can be contacted at jcar@eden.dei.uc.pt .

Up to the Core Technologies section contentsGo to next article: Building Bridges and Secure ConnectionsSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network