soul-searching is a new benchmark, known as SPEC
95. Thanks to new test programs and a new baseline machine, SPEC95 creates a more level playing field for comparing different systems and microprocessors.
Of course, you should view any benchmarks with a skeptical eye. Nobody has yet invented a canned benchmark test that precisely measures how
your
system will perform when running
your
software. For some users, the only worthwhile benchmarks are the ones that they create and run themselves using real applications. But standardized benchmarks such as SPEC95 are nevertheless useful for obtaining ballpark estimates of how different systems will perform under actual conditions. At the very least, you can use them as a broad, initial screen.
Rocketing Ratios
SPEC95 includes two test suites
. One, written in C, measures integer performance; another, written in FORTRAN, measures floating-point performance. These programs deliver their results indexed to a standardized baseline
system (a 40-MHz Sun SparcStation 10) scoring 1.0. In other words, if the SPECint95 result is 5.0, then the tested system is five times faster at integer tasks than the baseline system.
SPEC92's baseline system
was a Digital VAX-11/780. This system was also the baseline for another, more dubious, benchmark: Dhrystone MIPS. This benchmark fell into disrepute over the years and is sometimes defined as "meaningless indicator of performance" or "marketing's idea of performance."
Actually, MIPS and SPEC92 were tolerable benchmarks until some CPUs began scoring ratios in the hundreds. For example, Digital Equipment estimates that its Alpha 21164 processor scores greater than 500 SPECint92 and 750 SPECfp92. Digital predicts that a next-generation Alpha processor, which is scheduled for introduction in 1997, will attain 1000 SPECint92. Sky-high ratios like these obscure the differences between systems and might indicate that some tests ran too fast to permit accurate measurement.
To keep the test ratios from soaring out of control (at least for a while), SPEC95 adopted the new SparcStation 10 baseline system. As a result, Digital says that the same Alpha 21164 chip that was estimated at 500 SPECint92 and 750 SPECfp92 now achieves 11 SPECint95 and 17 SPECfp95.
Another problem SPEC tackled was optimized compilers. SPEC had to abandon one floating-point test entirely when a new FORTRAN compiler knocked performance ratios right out of the park. The combination of faster systems and smarter compilers creates a problem for all types of benchmark programs.
How do you write a program that's complex enough to test a system acceptably well under all conditions but that also runs in a reasonable amount of time? SPEC's latest answer is that you can't. That's why if you run the full suite of SPEC95 tests on the new baseline machine, you won't get your results for two days.
New Requirements
The SPEC95 benchmark suite consists of programs culled from
various sources, primarily academic and scientific. SPEC's first alteration was to replace some of SPEC92's small programs with more demanding ones. The goal was not only to create longer run times but also to present a more accurate picture of true performance by using larger, more resource-intensive programs. The SPEC95 code is portable and runs on just about any flavor of Unix. Soon there will also be a version available for Windows NT.
Of course, compiled programs measure the efficiency of a compiler as much as they measure the performance of a system. SPEC answers this criticism in two ways. First, SPEC acknowledges that compilers and optimizers can have a significant impact on the results. Second, SPEC now requires vendors to run the benchmarks with limited optimizations--no more than four optimization flags. Vendors must use the same flags for all tests and report their optimizations.
SPEC never intended for SPEC92 to measure I/O performance, but sometimes the larger tests overflowed a s
ystem's main memory, forcing it to use virtual memory. As a result, machines with faster disk I/O performed much better. To avoid this situation, SPEC95 requires the test system to have at least 64 MB of RAM (Windows NT systems, too).
By changing the test code and defining a new baseline, SPEC has made it almost impossible to devise a conversion formula that translates SPEC92 results into SPEC95 numbers. SPEC wisely discourages this because the two tests are not comparable. Major elements of the SPEC92 suite don't exist in SPEC95. The new benchmarks place less emphasis on floating-point math, because integer operations are more typical in real-world applications. And some tests in the SPEC95 suite run for a given period of time rather than for a given number of iterations, making comparisons with SPEC92 still more difficult.
Therefore, to obtain SPEC95 results for older systems, you have to run the SPEC95 suite on those machines. Unfortunately, many of them can't meet the minimum RAM requirement
of 64 MB. That's why you won't see SPEC benchmarks for all six generations of the Intel x86 architecture going back to 1978. For those kinds of historical comparisons, we're still stuck with MIPS.
Looking for Respect
One of the best things about SPEC is that it's unbiased. Even though vendors such as IBM and Intel help define the benchmarks, SPEC is a nonprofit organization that makes everybody play by the same rules.
Limiting the compiler optimizations is just one example. There's also a whole set of "run rules" that govern compilation, testing, and system configuration. Vendors must follow yet another set of rules when publishing their test results.
Official SPEC95 test reports will have at least two numbers. SPECint_base95 measures integer performance with minimal compiler optimizations; SPECfp_base95 does the same for floating-point performance. These are probably the most trustworthy numbers because they obey the most stringent rules. However, it's likely that y
ou'll see two additional results: SPECint95 and SPECfp95. These tests allow maximum compiler optimizations, which brings the compiler's performance into the mix.
However, some system vendors prefer to report SPECint_rate95, which is an entirely different result; it measures throughput ratios. Instead of measuring a machine's performance while running a single program, SPECint_rate95 is based on repeated tests that count how many iterations that a machine performs within a fixed amount of time. Here, factors such as cache efficiency make a difference.
Vendors are free to use any compiler optimization that they want for the SPECint_rate95 test. If they decide to use minimal optimizations, then they can report the result as SPECint_rate_base95. The parallel floating-point equivalents for these two tests are SPECfp_rate95 and SPECfp_rate_base95.
How to Use SPEC95
Anyone can purchase the SPEC benchmark suite on CD-ROM for $600. It includes all the tools you need to compile
and run the programs. Vendors can submit results to SPEC, which reviews them. If vendors don't conform to all the run rules, they don't get published in the SPEC newsletter. Of course, SPEC can't stop anyone from publishing numbers elsewhere.
When interpreting SPEC results, it's important to keep a few things in mind. First, although SPEC has attempted to devise a suite that closely mimics system behavior when running real applications, these are still synthetic benchmarks. Your mileage may vary.
Second, remember that SPEC95 does not test I/O performance. If your application is I/O-intensive--on-line transaction processing, for instance--SPEC95 probably won't be as meaningful as a disk-I/O benchmark. If the responsiveness of a GUI is important to you, SPEC95 isn't the best choice for that, either.
Finally, don't attempt to map specific SPEC95 test programs to your real-world applications. SPEC95 is a collection of programs that lets you compare one system's basic performance to another's
. There's still no substitute for running real programs on the system you're trying to evaluate.
If you're going to base a major purchasing decision on SPEC results, you might want to compile and run the tests yourself. At the very least, obtain the benchmark results directly from SPEC. If these results are markedly worse than the vendor's published numbers, demand an explanation from the vendor. SPEC's new rules should make it more difficult for vendors to rig the tests with their own benchmark-specific optimizations. SPEC has put the world on notice that if it uncovers any such optimizations, it will change the suite to close the loophole.
SPEC deserves praise for its dedication to providing reliable test data. SPEC95 is a definite improvement over SPEC92.
WHERE TO FIND
Standard Performance Evaluation Corp.
National Computer Graphics Association
Fairfax, VA
Phone: (703) 698-9604
E-Mail:
spec-ncga@cup.portal.com
HotBYTEs
- information on products covered or advertised in BYTE
It's based on a new suite of programs.
SPEC95 INTEGER TESTS
Game of Go 099.go
Motorola 88000 RISC CPU simulator 124.m88ksim
GNU C compiler 126.gcc
File compression/decompression 129.compress
LISP interpreter 130.li
JPEG compression/decompression 132.ijpeg
String and integer manipulations 134.perl
in the Perl language
Database 147.vortex
SPEC95 FLOATING-POINT TESTS
Mesh generator 101.tomcatv
Shallow-water model 102.swim
Quantum physics 103.su2cor
Astrophysics
104.hydro2d
Multigrid solver in 3-D potential field 107.mgrid
Differential equations 110.applu
Simulated turbulence in a cube 125.turb3d
Weather conditions and 141.apsi
distribution of pollutants
Quantum chemistry 145.fpppp
Plasma physics 146.wave5
Because SPEC95 numbers are indexed to a different baseline system,
they can't be compared directly to SPEC92 values. (Values shown are
for a 150-MHz Pentium Pro.) Even the ratio of floating-point to
integer performance might vary from the old benchmark, as this
example for the Pentium Pro 150 shows.
SPEC95 SPEC92
Floating-point 5.41 220
Integer 6.08 276.3
SPECint/SPECfp 1.12 1.26
Tom Yager is a freelance writer and an evangelist for the Matrox Video Pro
ducts Group. He works from his research lab in North Texas. You can reach him on the Internet at
tyager@maxx.net
or on BIX c/o "editors."