The specific benchmark compon
ent involved was LU Decomposition, one of the three members of the floating-point portion of the BYTEmark. This benchmark test gets its name from the algorithm it uses, which breaks a single matrix into two matrices: one lower-triangular (all elements above and to the right of the diagonal are zero), the other upper-triangular (all elements below and to the right of the diagonal are zero).
The algorithm might sound odd--and even a little pointless--but as it turns out, this is an effective technique for solving linear equations. You will find the details of the algorithm in many texts on numerical algorithms, such as
Numerical Recipes in Pascal
, by Press, Flannery, Teukolsky, and Vetterling (Cambridge University Press, 1989). What's important for our purposes is the fact that LU Decomposition does a great deal of array manipulation. Furthermore, the elements of the arrays are floating-point double values. Remember that.
The Symptoms
Each of the tests in the benchmar
k suite is run multiple times (at least five) so that the system can form a collection of results to calculate an average. This also allows the system to verify that the results are reasonably close to one another. If the results vary wildly, then either the benchmark has a problem or there's some outside force (say, a background task) running on the system that's affecting the program.
Such was the case with the LU Decomposition test. The test results exhibited a
bimodal
distribution: The score was either an unreasonably low number or a reasonably high number--never anything in between. We were frustrated in our attempts to locate the cause of this effect. For example, on one Windows NT-based system, we found that running the Performance Monitor in the background always produced low numbers. On another system, if we ran one of the other benchmark tests--Numeric Sort--prior to running LU Decomposition, regardless of how many intervening tests we ran, LU Decomposition yielded low numbers.
Furthermore, this behavior appeared only on Intel machines running Windows 95 or NT. We couldn't make it happen on these machines running a 32-bit DOS extender, nor would it occur on a PowerPC machine running NT.
Early Diagnosis
The first suspect was a temporary memory buffer allocated and released within the LU Decomposition test. This was the only case of an OS call within a timing routine, so we moved it outside the timing loop. The results were negative.
Next we investigated the possibility of a memory leak. Fortunately, Mike Spertus of Geodesic Systems was here at BYTE demonstrating his company's product, Great Circle (see
"A Programmer Needs a Maid,"
January BYTE), which detects and corrects memory leaks. He came into the lab and linked his library into the benchmark code. Great Circle gave the benchmarks a clean bill of health--no memory leaks.
We were getting closer, however. Although Great Circle found no leaks, it made the problem disappear. It ha
d something to do with memory, but what?
Proper Diagnosis
The answer came in an E-mail message from Rob Barris of QuickSilver Software. He suggested that it might be a problem with the alignment of memory allocated by
malloc()
. Sure enough, a quick patch to the program allowed us to see that the test returned a low score whenever the allocated memory was
not
aligned to an address evenly divisible by 8. (Returned addresses were always divisible by 4.)
It was all clear now. The LU Decomposition test used 8-byte floating-point doubles. Whenever
malloc()
returned an unaligned memory block, all fetches and stores were consequently much slower. Great Circle had cured the problem because it takes over
malloc()
, and its
malloc()
replacement always returns data aligned to 8-byte boundaries.
The Cure
The solution was apparent: Write a
wrapper
(an enclosing routine) around
malloc()
that provides
aligned memory. Fortunately, the benchmark was written in the days when there was still a need to test 16-bit systems and all memory allocation went through a single routine,
AllocateMemory()
. In a sense, we already had our wrapper; we just had to make it smarter.
So,
malloc()
returns a memory-block pointer, which the wrapper routine examines and simply increments until the address is properly aligned, right? Well, not so fast. First, if you advance the memory block's pointer, you run the risk that a routine might overstep the block's end somewhere down the road. By advancing the address, you effectively shave bytes off the front of the memory block.
Second, when the program finishes with the memory block, the address is passed to the
free()
routine so that the C library can release the memory. Given that the wrapper has advanced the address of that pointer, what is returned to
free()
is not what was given to the application by
malloc()
. The program w
ill crash.
You can eliminate the first problem by having the
malloc()
wrapper allocate slightly more memory than originally requested. This additional "buffer" lets the program advance the pointer with no danger of returning a memory block that's too short.
The fix for the second problem is a bit more difficult. One approach is to allocate an additional pointer's worth of memory at the front of the block. Once the pointer is adjusted appropriately, the real pointer is stored a pointer's distance back in the block. (We say "pointer's distance" because, although most pointers are 32 bits wide, 64-bit machines are already appearing on some desktops.) When it's time to free the memory, a wrapper around
free()
backs up to locate the original pointer. With this scheme, the
malloc()
and
free()
wrappers are
stateless
: They don't have to remember anything about the memory they've allocated; all the information they need is in the memory block. (See the figure
"Stateless
malloc()
Knows Its Own Address"
.)
Nevertheless, we chose a simpler way out. We set up a global array that associates the pointer returned by
malloc()
with the adjusted pointer passed to the routine. We could get away with this because the algorithms of the benchmarks tend to allocate a few big blocks rather than many smaller blocks, so we had no fear of overruning the global array.
The Vector
Now that we have our fix, it's time to ask a few rhetorical questions. Should we have had to do this in the first place? Are we simply bellyaching if we complain that a low-level detail such as the alignment of memory returned by
malloc()
is an issue we really shouldn't be concerned with? Or are we justified in complaining that the compiler vendors--who are, after all, providing products that are supposed to save programmers from the perils of dealing too closely with OS APIs and hardware details--should make better
malloc
()
s?
Our copy of
Standard C Programmer's Quick Reference Manual
, by Plauger and Brodie (Microsoft Press, 1989), simply says of
malloc()
that "you can safely convert the return value to a data object pointer of any type" whose size is no bigger than the maximum allowed by
malloc()
(the maximum being implementation dependent). No mention is made of performance requirements.
However, in
C: A Reference Manual
, by Harbison and Steele (Prentice-Hall, 1995), we find a section on alignment restrictions: "The C programmer is not normally aware of alignment restrictions, because the compiler takes care to place data on the appropriate address boundaries." Just what is meant by "appropriate" here is unclear. We know that on some machines, inappropriate alignment can cause a crash. Is a 4-byte alignment appropriate for an Intel machine? Or is that merely sufficient? Is sufficient enough?
Daniel Boulet of Boulet Fermat Associates sent us Internet mail on this m
atter. He told us that the Unix System V Interface Definition (the third edition of SVR4)
man
page on
malloc()
reads: "The function
malloc()
returns a pointer to a block suitably aligned for any purpose." Then Boulet makes this point: If performance implications don't play a part in the definition of "suitably aligned," then why not just implement
malloc()
to return arbitrarily aligned data on architectures that support it?
The Release
Boulet's question is a good one, but we're still unsure of where the blame--if you can call it that--lies. Should we have known better? If so, shame on us. Should the compiler vendors have known better? If so, shame on them.
For yourself, if your current programming project includes algorithms that manipulate large arrays, be warned--particularly if you're working with Intel machines and you deal with data types that are larger than the bit size of the OS. (We suspect that the reason
malloc()
on Win
dows 95 and NT returned memory aligned to 4-byte boundaries is that those OSes are being touted as 32-bit OSes.) Although our experience was with an array of floating-point doubles, we suspect that an array of data structures might experience similar problems if one of the structure's elements fell on an inappropriate address boundary.
For ourselves, we've "fixed" the BYTEmark. And although it may cause us to sigh over our unnecessary work, we hope that the next release of our favorite 32-bit C++ compilers from Microsoft and Watcom will include a slightly modified version of the run-time library routine
malloc()
.
Editor's note:
As of this writing, the source to the updated BYTEmark suite is available for downloading from the BYTE Web Site (
http://www.byte.com
).