Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesBug-Free Benchmarks


March 1996 / Core Technologies / Bug-Free Benchmarks

Investigation into the erratic behavior of a benchmark leads to an unsettling discovery about memory allocation

Rick Grehan

If you've ever had a perfectly good machine go haywire on you--perhaps a new car that breaks down as you exit the dealer's lot--then you'll understand this. The BYTEmark, a program that we had worked on for better than a year--which had successfully run on CPUs ranging from PowerPCs to 286s, and on OSes from Unix to DOS--began cracking up. We were perturbed; Intel wasn't pleased, either, because the problem caused us to report erroneous numbers on the P6 processor in "BYTEmark Bug Bashed" (December 1995 BYTE).

The Patient

The specific benchmark compon ent involved was LU Decomposition, one of the three members of the floating-point portion of the BYTEmark. This benchmark test gets its name from the algorithm it uses, which breaks a single matrix into two matrices: one lower-triangular (all elements above and to the right of the diagonal are zero), the other upper-triangular (all elements below and to the right of the diagonal are zero).

The algorithm might sound odd--and even a little pointless--but as it turns out, this is an effective technique for solving linear equations. You will find the details of the algorithm in many texts on numerical algorithms, such as Numerical Recipes in Pascal , by Press, Flannery, Teukolsky, and Vetterling (Cambridge University Press, 1989). What's important for our purposes is the fact that LU Decomposition does a great deal of array manipulation. Furthermore, the elements of the arrays are floating-point double values. Remember that.

The Symptoms

Each of the tests in the benchmar k suite is run multiple times (at least five) so that the system can form a collection of results to calculate an average. This also allows the system to verify that the results are reasonably close to one another. If the results vary wildly, then either the benchmark has a problem or there's some outside force (say, a background task) running on the system that's affecting the program.

Such was the case with the LU Decomposition test. The test results exhibited a bimodal distribution: The score was either an unreasonably low number or a reasonably high number--never anything in between. We were frustrated in our attempts to locate the cause of this effect. For example, on one Windows NT-based system, we found that running the Performance Monitor in the background always produced low numbers. On another system, if we ran one of the other benchmark tests--Numeric Sort--prior to running LU Decomposition, regardless of how many intervening tests we ran, LU Decomposition yielded low numbers.

Furthermore, this behavior appeared only on Intel machines running Windows 95 or NT. We couldn't make it happen on these machines running a 32-bit DOS extender, nor would it occur on a PowerPC machine running NT.

Early Diagnosis

The first suspect was a temporary memory buffer allocated and released within the LU Decomposition test. This was the only case of an OS call within a timing routine, so we moved it outside the timing loop. The results were negative.

Next we investigated the possibility of a memory leak. Fortunately, Mike Spertus of Geodesic Systems was here at BYTE demonstrating his company's product, Great Circle (see "A Programmer Needs a Maid," January BYTE), which detects and corrects memory leaks. He came into the lab and linked his library into the benchmark code. Great Circle gave the benchmarks a clean bill of health--no memory leaks.

We were getting closer, however. Although Great Circle found no leaks, it made the problem disappear. It ha d something to do with memory, but what?

Proper Diagnosis

The answer came in an E-mail message from Rob Barris of QuickSilver Software. He suggested that it might be a problem with the alignment of memory allocated by malloc() . Sure enough, a quick patch to the program allowed us to see that the test returned a low score whenever the allocated memory was not aligned to an address evenly divisible by 8. (Returned addresses were always divisible by 4.)

It was all clear now. The LU Decomposition test used 8-byte floating-point doubles. Whenever malloc() returned an unaligned memory block, all fetches and stores were consequently much slower. Great Circle had cured the problem because it takes over malloc() , and its malloc() replacement always returns data aligned to 8-byte boundaries.

The Cure

The solution was apparent: Write a wrapper (an enclosing routine) around malloc() that provides aligned memory. Fortunately, the benchmark was written in the days when there was still a need to test 16-bit systems and all memory allocation went through a single routine, AllocateMemory() . In a sense, we already had our wrapper; we just had to make it smarter.

So, malloc() returns a memory-block pointer, which the wrapper routine examines and simply increments until the address is properly aligned, right? Well, not so fast. First, if you advance the memory block's pointer, you run the risk that a routine might overstep the block's end somewhere down the road. By advancing the address, you effectively shave bytes off the front of the memory block.

Second, when the program finishes with the memory block, the address is passed to the free() routine so that the C library can release the memory. Given that the wrapper has advanced the address of that pointer, what is returned to free() is not what was given to the application by malloc() . The program w ill crash.

You can eliminate the first problem by having the malloc() wrapper allocate slightly more memory than originally requested. This additional "buffer" lets the program advance the pointer with no danger of returning a memory block that's too short.

The fix for the second problem is a bit more difficult. One approach is to allocate an additional pointer's worth of memory at the front of the block. Once the pointer is adjusted appropriately, the real pointer is stored a pointer's distance back in the block. (We say "pointer's distance" because, although most pointers are 32 bits wide, 64-bit machines are already appearing on some desktops.) When it's time to free the memory, a wrapper around free() backs up to locate the original pointer. With this scheme, the malloc() and free() wrappers are stateless : They don't have to remember anything about the memory they've allocated; all the information they need is in the memory block. (See the figure "Stateless malloc() Knows Its Own Address" .)

Nevertheless, we chose a simpler way out. We set up a global array that associates the pointer returned by malloc() with the adjusted pointer passed to the routine. We could get away with this because the algorithms of the benchmarks tend to allocate a few big blocks rather than many smaller blocks, so we had no fear of overruning the global array.

The Vector

Now that we have our fix, it's time to ask a few rhetorical questions. Should we have had to do this in the first place? Are we simply bellyaching if we complain that a low-level detail such as the alignment of memory returned by malloc() is an issue we really shouldn't be concerned with? Or are we justified in complaining that the compiler vendors--who are, after all, providing products that are supposed to save programmers from the perils of dealing too closely with OS APIs and hardware details--should make better malloc () s?

Our copy of Standard C Programmer's Quick Reference Manual , by Plauger and Brodie (Microsoft Press, 1989), simply says of malloc() that "you can safely convert the return value to a data object pointer of any type" whose size is no bigger than the maximum allowed by malloc() (the maximum being implementation dependent). No mention is made of performance requirements.

However, in C: A Reference Manual , by Harbison and Steele (Prentice-Hall, 1995), we find a section on alignment restrictions: "The C programmer is not normally aware of alignment restrictions, because the compiler takes care to place data on the appropriate address boundaries." Just what is meant by "appropriate" here is unclear. We know that on some machines, inappropriate alignment can cause a crash. Is a 4-byte alignment appropriate for an Intel machine? Or is that merely sufficient? Is sufficient enough?

Daniel Boulet of Boulet Fermat Associates sent us Internet mail on this m atter. He told us that the Unix System V Interface Definition (the third edition of SVR4) man page on malloc() reads: "The function malloc() returns a pointer to a block suitably aligned for any purpose." Then Boulet makes this point: If performance implications don't play a part in the definition of "suitably aligned," then why not just implement malloc() to return arbitrarily aligned data on architectures that support it?

The Release

Boulet's question is a good one, but we're still unsure of where the blame--if you can call it that--lies. Should we have known better? If so, shame on us. Should the compiler vendors have known better? If so, shame on them.

For yourself, if your current programming project includes algorithms that manipulate large arrays, be warned--particularly if you're working with Intel machines and you deal with data types that are larger than the bit size of the OS. (We suspect that the reason malloc() on Win dows 95 and NT returned memory aligned to 4-byte boundaries is that those OSes are being touted as 32-bit OSes.) Although our experience was with an array of floating-point doubles, we suspect that an array of data structures might experience similar problems if one of the structure's elements fell on an inappropriate address boundary.

For ourselves, we've "fixed" the BYTEmark. And although it may cause us to sigh over our unnecessary work, we hope that the next release of our favorite 32-bit C++ compilers from Microsoft and Watcom will include a slightly modified version of the run-time library routine malloc() .

Editor's note: As of this writing, the source to the updated BYTEmark suite is available for downloading from the BYTE Web Site ( http://www.byte.com ).


Acknowledgments

Thanks to Mike Spertus, Rob Barris, and Daniel Boulet. Also, the compilers whose malloc() routine exhibited the behavior described were Visual C++ 2.0 from Microsoft and Watcom C++ 10.5 from Watcom. Other compilers may or may not produce similar results.


Stateless malloc() Knows Its Own Address

illustration_link (7 Kbytes)

A "stateless" malloc() wrapper. In (a) , malloc() returns a pointer (mptr) to a block, the size of which is purposefully set longer than needed.

In (b) , the wrap per locates the first aligned address within the block and stores the original pointer just "behind" it. The aligned pointer (aptr) is returned to the application. Later, a wrapper around free() can retrieve the original pointer by looking behind aptr.


Rick Grehan, who developed the BYTEmark benchmark suite, is a senior technical editor for BYTE reviews. You can reach him on the Internet or BIX at rick_g@bix.com .

Up to the Core Technologies section contentsGo to previous article: Go to next article: Internet-Aware ApplicationsSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network