Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesWhy Mainframes Rarely Crash


April 1998 / Cover Story / Crash-Proof Computing / Why Mainframes Rarely Crash

Mainframes can achieve "four nines" or "five nines" availability: 99.99 or 99.999 percent uptime. That translates into only 5 to 53 minutes of downtime per year. In fact, IBM's Server Group claims that the mean time between critical failures (MTBCF) for its System/390 mainframes -- that is, the average time between failures that force a reboot and an initial program load -- is 20 to 30 years.

Millions of PC users would be overjoyed with an MTBCF of just one day. Yet mainframes are big, complex systems that often have clusters of CPUs, gigab ytes of main memory, and thousands of users. What makes them so reliable?

Mainframe experts say that it's a matter of priorities. When a PC crashes, even the system administrator might not hear about it, much less the vendors who made the system, the OS, and the application software. The user shrugs, reboots, and keeps right on working. When a mainframe crashes, however, it's a major catastrophe. It's General Motors calling up IBM to demand answers. And even if GM doesn't make the call, the mainframe does. Periodically, the massive machines dial up IBM's lab in Poughkeepsie, New York, to upload error logs and download updates. "Even if it doesn't crash, we know about it," says Lisa Spainhower, System/390 senior technical staff member.

During the beginning of the 1980s, Big Blue set a goal of increasing availability by a factor of 100, as measured by yearly uptime. IBM achieved that goal, says Spainhower. "Frankly, we didn't do it because it was a fun engineering project," she explains. "We did it because our customers demanded it."

Because everyone keeps detailed logs, problems rarely get ignored for long. There's too much at stake. Of course, it helps that mainframes have full-time technicians available to keep them up and running. They also have redundant hardware, extremely protective OSes, and stable applications.

"The design of a crash-proof system must be pervasive," explains Guru Rao, System/390 chief engineer. "It starts with your choice of technology and components, and it extends all the way to the design of the OS, the hardware and software, and the customer's applications."

System/390 maintains separate memory partitions for the OS (OS/390), the software-subsystem components (e.g., DB2 database drivers), the transactional middleware (e.g., the Customer Informatio n Control System, or CICS), and the applications. IBM introduced this so-called Enterprise Systems Architecture (ESA) in the late 1980s, basing it on the earlier partitioning of MVS (Multiple Virtual Storage). Compared to MVS, ESA has more partitions and faster interprocess communications (IPC).

As a result, it's exceedingly rare for a crashed application to bring down the entire system. Even if a critical middleware component, such as CICS, fails, System/390's automatic restart manager can restore the task.

"These systems, like PCs, do fail," notes Spainhower. "It's just that when they fail, they detect the errors and recover from them with greater reliability."

Interestingly, mainframe OSes aren't any bigger than OSes for PCs. They contain a lot less code to support GUIs, and a lot more code for error detection, error isolation, and recovery. They're not growing as fast as OSes for PCs are, and their code tends to remain more stable.

"It would almost take an act of God to change the dispatch er in IBM's mainframe OS," says Dr. Barry Feigenbaum, senior software engineer for IBM network-computing software solutions. "It's not quite the same on PC OSes."

As ambitious PC vendors try to encroach on the territory of enterprise servers, they will have to address the same concerns that mainframe vendors did in the 1980s. The contest isn't about megahertz and megabytes; it's about high availability. And that will require PC vendors to radically change their priorities.


IBM OS/390 System Architecture

illustration_link (30 Kbytes)


Up to the Cover Story section contentsGo to previous article: Why Mainframes Rarely CrashGo to next article: It's a Hardware Problem!
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network