Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesMeasure for Measure


October 19 94 / Features / Measure for Measure

The SPEC CPU benchmarks provide a standard yardstick for comparing performance across platforms

Oliver Sharp and David F. Bacon

Everyone agrees that the best way to find the machine that meets your needs is to run your own mix of applications on it and measure the results. Since this is often impractical, if not downright impossible, the next-best approach is to run a typical mix of programs and average the results together to come up with a measure of performance. This is what the SPEC benchmark suite tries to do.

Produced by the System Performance Evaluation Corp. (and hence their name), the SPEC benchmarks are widely used as a tool for comparing the performance of platforms that use different processors. By knowing enough about what the SPEC suite and other benchmarks measure, you can discover how to best us e these tools to evaluate systems that match your needs.

Before SPEC

Prior to the advent of the SPEC suite, there were two common ways of reporting performance. The first was the easiest: using system parameters, such as the clock rate of the processor or the number of instructions processed per time unit. This latter value, usually expressed in MIPS (millions of instructions per second), was popular for a while. However, it was never terribly accurate in comparing different architectures, and it became even more troublesome when RISC processors became popular.

RISC processors use simple instructions, so they need to process more instructions to do the same amount of work as a CISC machine. The Intel x86 architecture also causes trouble during performance measurements because it has wildly varying execution timings, depending on which instructions are being measured. Manufacturers tried to patch up the problem by using VAX MIPS--millions of VAX-equivalent instructions per second. As you c an imagine, there were a lot of complaints about the way these numbers were computed.

The other common strategy for measuring performance relied on ``synthetic'' benchmarks, such as Whetstone and Dhrystone (see the text box ``A World of Benchmarks'' on page 68). These short programs were developed in an attempt to mimic the behavior of existing applications; a programmer typically studied a set of applications and developed code that performed a representative mixture of arithmetic computations, loops, function calls, and so forth.

Aside from the problem of making such codes truly representative of real applications, synthetic benchmarks began to fall afoul of the improvements that were made in compiler optimization. These improved compilers could determine that many computations were not actually being used and optimized them out of the code, making a mockery of the benchmark. Peculiarities in architecture design also skewed results. A system might be particularly efficient in one feature that some benchmark used heavily (e.g., function-call overhead) and would thus look better than it probably should have.

Enter SPEC

Realizing that a realistic and widely used benchmark would be a major step forward, a group of companies, including DEC, Hewlett-Packard, IBM, Intel, and Sun, joined together to form SPEC. This nonprofit company is charged with developing and supporting standardized benchmarks. SPEC is best known for its CPU performance suite, but it has developed, and continues to investigate, benchmarks in other areas, such as graphics and networks.

SPEC has identified a set of programs in widespread use, frozen the source code, established a way to measure performance, and defined a formula for averaging the individual results. The programs are divided into two sets: one that relies on integer computations and one that relies on floating-point operations. The original SPEC benchmark suite was released in 1989 (and is thus called SPEC89). SPEC92 is a more recent follow-up, exten ding the total number of programs in the two sets from 10 to 20.

Both suites measure the performance of each program and combine the values into summary statistics. The strategy for measuring a program is to time its execution and compute what's known as its SPECratio by dividing a reference value by the execution time. If the reference is 10,000 seconds, for example, a 1000-second run yields a SPECratio of 10. The reference value is the execution time on a VAX-11/780, a popular VAX model.

SPEC89 defined three summary metrics: SPECint89, SPECfp89, and SPECmark89. To compute SPECint89, the benchmark finds the geometric average of the SPECratios for each integer-based program. SPECfp89 is the analogous result for the floating-point programs. SPECmark89 is computed by taking the geometric average of the other two values in an attempt to describe a system's overall performance with a single number.

SPEC made a few changes to the suite when it released a second, and considerably expanded, vers ion in 1992. It decided that the integer and floating-point measurements were too different to combine into one value, so SPECmarks were eliminated in SPEC92. In addition, there are two new ratings, which are called SPECrate_fp92 and SPECrate_int92. These are designed to measure how well the system handles multitasking and are computed by running multiple copies of a benchmark simultaneously.

The SPECrate formula takes the ratio of reference time to measured time and scales it by a constant value and by the number of instances of the benchmark that are executing. This value cannot be compared to the SPECfp or SPECint rating, but it lets you compare how one architecture versus another degrades due to multitasking.

The suite has changed substantially, so SPEC89 and SPEC92 values cannot be compared. SPEC recommends that the SPEC89 suite no longer be used, so this article will focus on the newer version.

The SPEC92 floating-point suite contains 14 programs; the integer suite has six. The tabl e ``The SPEC92 Benchmark Suite, Release 1.1'' on page 66 summarizes the programs, noting the language they are written in, their size, numeric precision (if relevant), and whether they vectorize, and gives a brief description of each.

The table ``SPEC Results'' on page 70 shows the SPEC ratings of a number of different machines. The values are a subset of those published in the SPEC newsletter. If you have access to the World Wide Web, you can find the latest newsletters and other information on the University of Tennessee server at http://netlib2.cs.utk.edu/performance/html/PDStop.html.

Interpreting SPEC

Although the SPEC rating shouldn't be used blindly, the existence of the suite and its standardization have constituted a great step forward in benchmarking. It is quite a useful measure for the general-purpose computer user and represents a major improvement over its predecessors. The participation of many different companies keeps the playing field relatively level, although there has been no shortage of intense politicking and internal struggle.

The most important point to keep in mind when reading and comparing SPEC numbers is that they are narrowly focused on measuring the performance of the CPU (or, more accurately, the ability of the CPU, memory system, and compiler to cooperate). While the speed of the CPU is certainly an important part of a machine's performance, other issues can be much more important, depending on the way the machine is being used. For instance, many of the huge mainframes used by banks to handle check transactions offer relatively modest CPU performance because they are optimized for I/O operations. But trying to replace them with a workstation that has an equal or higher SPEC rating would be a total disaster.

Know What You're Measuring

Although the CPU is one of the easier parts of a system to measure, it isn't always the most important one, as the above example illustrates. Performance is always limited by the weakest link in the chain, so the system with the highest SPEC rating isn't necessarily the best one for your particular needs.

A common source of delay, for example, is I/O. While the CPU is waiting for the disk or the network, the number of MFLOPS it could otherwise perform may be impressive but won't help much. I/O is often triggered when the operating system runs out of RAM in the machine and is forced to swap data out. At a critical point, swapping turns into thrashing, and performance drops through the floor. For many systems, doubling the amount of RAM would do much more for performance than doubling the speed of the processor.

Applications where a system handles a series of updates to a database, known as transaction-processing applications, are often more dependent on I/O behavior than on the CPU's performance. To address the needs of this market niche, there are specialized benchmarks that are much more accurate in measuring transaction performance than the SPEC CPU suite.

Another important thing to consider is whether you are running programs that have been compiled to run on your CPU. Because there is so much software available for the Intel x86 chips, many of the fast RISC CPUs use emulation to give users access to more programs. Unfortunately, emulation takes a terrible toll on performance, slowing down a chip by a factor of 3 or more (sometimes much more). For example, although a DEC Alpha 21064 is much faster than a 486/33 when running native code, it's much slower than the 486/33 when emulating the x86 instruction set.

Finally, there is a wide variety of specialized hardware that may be the limiting factor in determining performance. For example, some machines sport a DSP (digital signal processing) chip to manage sound waves; these chips encode a small set of operations into hardware so that it can execute them quickly. If your machine is largely used for sound mixing and has sufficient I/O capacity, the performance of the CPU may not be particularly important. Some machines have special engines t hat handle the math needed for 3-D graphics. On a more modest scale, the display adapter in a personal computer may be the part of the system that has the most effect on the user's perception of its performance.

SPEC and You

The SPEC applications are designed to reflect the needs of a typical computer user, so you may be able to engage in some selective interpretation to make the statistics more useful to you. The integer programs range from system-administration to programming and business applications, while the floating-point codes include a wide variety of scientific programs.

The simplest way to refine your understanding of the ratings is to pay attention to only one or the other number. If your needs do not include molecular modeling and computational fluid dynamics, for example, you may find that the SPECfp rating of your processor is largely irrelevant. The Intel x86 architecture has never been a very fast floating-point engine, but that fact has had little impact on most of the p eople who use it.

Some users, however, have unusual requirements and may need more information than they can get from the summary ratings. If you rely on a small group of specialized codes and are concerned about their performance, one solution is to cobble together your own summary statistics by choosing the most closely related members of the SPEC suite and ignoring the others.

In addition to the obvious distinction between integer and floating-point, the table ``The SPEC92 Benchmark Suite, Release 1.1'' shows whether each SPEC program is single- or double-precision and whether it vectorizes well. Some architectures are particularly good or bad at handling one level of precision versus another.

Vectorization can also skew results dramatically; a vector architecture or a superscalar one with a good compiler will execute vectorizable code quickly. If you can vectorize your program, the compiler's efficiency at finding opportunities and the architecture for exploiting them are of crucial c oncern. If, on the other hand, your code vectorizes poorly or not at all, the SPECfp ratings of vectorizing architectures could be misleading.

When you have the SPECratios of all the programs in the suite, you can compensate for these various factors simply by defining your own summary statistic, which you could call MySPECfp. The simplest way to do this is to pick the programs that are similar to your own applications and ignore the rest.

A more sophisticated approach is to assign each application a weight based on how relevant it seems to be to your own needs. Then you compute MySPECfp for each architecture of interest and use that rather than the standard value to make your comparison. You might also compute the ratio of MySPECfp to SPECfp on each architecture to see how much difference your customization makes to the results.

Even though most users can probably match their CPU requirements fairly well by picking and choosing among SPEC programs, there are a few operations that are not well represented. For instance, none of the programs is primarily dependent on the performance of pointer operations. If you have a program that spends most of its time in tight loops walking over complex data structures, the SPEC rating may not reveal which architecture is particularly well tuned for you.

Additionally, none of the SPEC programs is a heavy user of integer division. Although this operation isn't all that common, it is very important in manipulating images compressed using the MPEG format. In general, if you are very much dependent on a specific algorithm that may be unusual in its computational behavior, it is useful to run some of your own tests to supplement the SPEC ratings before you make a final decision about which system is best for your needs.

Reliability and Benchmarks

The most difficult task in benchmarking is achieving consistency in the face of intelligent and motivated adversaries and a broad variety of architectures, compilers, and environments. The war betw een the benchmark developers and those who try to outwit them has a long and colorful history, attesting to the ingenuity and persistence of both sides.

You can infer some of the tactics of the past by reading the document that defines how SPEC ratings can be computed. It forbids, for example, the insertion of special code into the executable based on the name of the function being compiled. This was a classic gambit done by compiler writers, who could use highly optimized and hand-tuned code for the key routines in a benchmark.

The measurement programs provided with the benchmark suite also check the output of each program to make sure that the architecture not only runs quickly but also produces the correct answer. Many optimizing compilers offer switches that allow them to assume the program is well behaved so they can use optimizations that would otherwise be unsafe. Not every benchmark user has been completely scrupulous in making sure that these assumptions were correct.

Another old trick is to have a special library that tunes the standard system routines for a particular benchmark. If the benchmark allocates memory only in, say, 200-byte chunks, the allocation routine can be rewritten so that it runs extremely fast. The SPEC suite can be compiled with specialized libraries as long as they are not specific to any individual program. Since the suite as a whole contains such a broad variety of programs, there is relatively little opportunity to affect the overall rating with such dubious tactics.

The guideline document does allow certain favorite tricks, as long as they are documented. For instance, the Unix operating system can usually be put into what is called single-user mode, where a number of the features of the operating-system kernel are disabled. Since there is less system overhead, performance can improve significantly. Source code changes are allowed when they are necessary for portability, but the fact that they were made must be noted when the test results are reporte d publicly.

Even without any covert gamesmanship, determining the best performance of a program on a given machine is difficult. Modern compilers often provide a lengthy list of switches that allow the programmer to fine-tune the optimization strategy. Subtle interactions can yield substantial differences in final performance that are difficult to predict. The manufacturer has a tremendous incentive to do everything possible to improve its products' SPEC ratings, so it will devote care and attention to that end and will thus achieve better performance than the average programmer would.

However, SPEC is in the process of changing its policy so that the reported results are more in line with the performance that users will actually see. SPEC is introducing a new rating, called the SPECbase, that places a set of restrictions on the flags that can be specified during compilation. So, in addition to SPECfp92, there is now a SPECbase_fp92, and so forth.

The new rating requires, among other thin gs, that the same flags be used for all benchmarks and that the options be safe. When reporting the results for a machine, manufacturers must report either just the SPECbase values or both the SPECbase results and the fully tuned results. The new policy should be in effect by the time you read this.

By providing a large suite of applications and restricting the tricks that manufacturers can use, SPEC has helped to make the numbers game more respectable. Although manufacturers examine each SPEC program carefully and tune a machine to improve its rating, the size and diversity of the SPEC applications make it difficult to perform them well without also speeding up everyone else's code. Wherever there are benchmarks, there will be efforts to outwit them, but the existence of the SPEC suite has done much to improve the honesty of reported results.

A Realistic Picture

The SPEC benchmarks are a major improvement over their predecessors. By relying on real applications, they provide a realistic picture of performance. However, they are not perfect. Before you accept the SPEC values as holy writ, you must decide whether the mix of applications in the benchmark suite is similar to your own. You must also consider how important CPU speed is to you and whether some other aspect of the system is the real performance bottleneck.

Seemingly authoritative measurements such as the SPEC values are seductively tempting because they make comparisons so easy. It's up to the savvy customer to look past the numbers to understand how they can be used in making informed decisions.

Contact Information

SPEC, c/o National ComputerGraphics Association2722 Merrilee Dr.,Suite 200Fairfax, VA 22031(703) 698-9600 ext. 325fax: (703) 560-2752E-mail: spec-ncga@cup.portal.com


The Spec92 Benchmark Suite, Release 1.1



                                    INT/    LANG-           VECTOR- PRECI-
NAME                                FLOAT   UAGE    L
INES   IZABLE? SION


008.espresso                        Int     C       11,000  No      N/A
Minimizes Boolean functions.


022.li                              Int     C       5000    No      N/A
Runs Lisp interpreter on nine-queens problem.


023.eqntott                         Int     C       2600    No      N/A
Translates Boolean equations into truth table.


026.compress                        Int     C       1000    No      N/A
Compresses a file using adaptive Lempel-Ziv coding.


072.sc                              Int     C       7100    No      N/A
Calculates values within a spreadsheet based on the curses
Unix cursor-control package.


085.gcc                             Int     C       58,800  No      N/A
Part of the GNU C compiler, translating source files into optimized
Sun-3 assembly language output.


013.spice2g6                        Float   Fortran 15,000  No      Double
Simulates analog circuits.


015.doduc                           Float
   Fortran 5300    No      Double
A Monte Carlo simulation based on a thermo-hydraulic model for a nuclear-reactor component.


034.mdljdp2                         Float   Fortran 3600    No      Double
Solves motion equations for a 500-atom model.


039.wave5                           Float   Fortran 6400    No      Single
Solves particle-in-cell simulation of equations of motion on a Cartesian mesh.


047.tomcatv                         Float   Fortran 100     Yes     Double
Generates 2-D boundary-fitted meshes for general geometric domains.


048 ora                             Float   Fortran 300     No      Double
Traces rays through an optical surface containing spherical and planar surfaces.


052.alvinn                          Float   C       200     No      Single
Trains neural networks through a back-propagation algorithm.


056.ear                             Float   C       3300    No      Single
An inner-ear simulation relying on fast Fourier transforms and
ot
her math-library functions.


077.mdljsp2                         Float   Fortran 3100    No      Single
Like 034.mdljdp2, solves motion equations for a model of 500 atoms.


078.swm256                          Float   Fortran 300     No      Single
Solves a system of shallow-water equations using finite-difference
approximations.


089.su2cor                          Float   Fortran 1700    Yes     Double
Calculates the mass of elementary particles in the framework of the Quark Gluon theory.


090.hydro2d                         Float   Fortran 1700    Yes     Double
Uses hydrodynamic Navier Stokes equations to calculate galactical jets.


093.nasa7                           Float   Fortran 800     No      Double
Executes seven program kernels of operations used frequently in NASA
applications, such as fast Fourier transforms and matrix manipulations.


094.fpppp                           Float   Fortran 2100    No      Double
Calculates a 2-electron integral derivative
 in quantum chemistry applications.


N/A = not applicable.




SPEC Results



One surprising aspect of perusing different SPEC ratings is the overlap between x86 systems and RISC-based workstations. A PC based on a 100-MHz Pentium outperforms a SparcStation 20 in integer calculations and costs less. The fastest RISC processors do, however, provide much better floating-point performance.
                                  CLOCK                             SPEC  SPEC
SYSTEM            PROCESSOR       (INTERNAL/BUS)  EXTERNAL CACHE    INT92 FP92
Compaq Deskpro    486DX2          66 MHz/33 MHz   256 KB (level 2)  32.2  16.0
Digital prototype DECchip 21164   300 MHz         None              300   510
HP 9000 735/125   PA-RISC 7150    125 MHz 256     KB-I/256 KB-D
                                                  (both level 1)    136   201
IBM RS/6000
Model 41          PowerPC 601     80 MHz          512 KB (level 2)  88    99
Intel prototype   Pentium         100
 MHz/66 MHz  512 KB (level 2)  100   80.6
SGI Indy          R4600PC         100 MHz         1 MB (level 2)    62.8  49.9
Sun SparcStation
20 Model 6        SuperSparc      60 MHz          1 MB (level 2)    89    103


Oliver Sharp works for Colusa Software in Berkeley, California. David F. Bacon is a researcher at the IBM T. J. Watson Research Center (Hawthorne, NY). Both are doctoral candidates at the University of California-Berkeley. You can contact them on the Internet at oliver@cs.berkeley.edu and dfb@cs.berkeley.edu , respectively, or on BIX c/o ``editors.''

Up to the Features section contentsGo to next article: A World of BenchmarksSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network