David F. Bacon and Peter Wayner
Adding cache memory is well recognized by computer buyers as a reasonable way to turbocharge a system's performance. Nowadays, however, the need for separate caches is disappearing as newer microprocessors add more cache directly onto the CPU die itself and as multitasking OSes fragment memory demands and lose much of the performance advantage that cache memory is supposed to provide.
Recent generations of CPU chips have had enough silicon real estate to include a small on-chip cache. These caches have generally been in the range of 8 to 32 KB, which is too small to help many applications. As a result, many computer systems have been built with a larger, off-ch
ip L2 (Level 2) cache to supplement the on-chip L1 cache.
However, on-chip caches are getting larger. Intel's newly announced P6, for example, has 256 KB of on-board L2 cache, while Digital Equipment's Alpha 21164 has 96 KB of on-chip L2 cache memory. With large on-chip caches like these, the complexity and expense of adding an L2 cache to a PC or workstation makes less sense, so we can expect to see fewer of those types of machines in years to come.
Large software packages and multitasking OSes like OS/2 Warp can destroy the value of a cache if it isn't large enough to hold all the code being executed. When the CPU switches between jobs, it can't find the information it needs in the cache, and it must request it from the substantially slower main memory. Users of Microsoft Windows, for example, may notice this effect already when they ask their system to print in the background. Many machines can't keep both the printing code and the Windows code in the cache simultaneously, so the constant swi
tching makes the system run at the slowest memory speeds.
Look for innovations in cache design driven by the growing presence of multiprocessors. Multiprocessors are just beginning to break into the mainstream server market, and with the demands of desktop conferencing and high-end multimedia applications, multiprocessors are likely to become the platform of choice for power users before too long.
Cache design for multiprocessors is considerably more complicated. If processor A wishes to update a memory location cached by processor B, B's copy must be either invalidated or updated by A. Even worse, if B has already modified its copy, then before A can proceed, B's data must be either flushed back to main memory or transmitted directly to processor A. So far, we've seen two different approaches to solving this. Either all the processors monitor all the memory traffic, looking for potential conflicts with their locally cached data (a snoopy cache), or the main memory controller keeps track of whic
h processors have cached which memory locations (a "directory-based" cache).
Each scheme has its advantages and disadvantages. Snoopy caches are generally easier to implement, but they require that all memory traffic goes over a shared bus. Directory-based caches require extra memory to keep track of the outstanding copies, but they can be used with more sophisticated processor-interconnection networks that provide higher bandwidth and scale to a larger number of processors.
Multiprocessor systems have been the subject of research for the past 30 years, but it's only in the last five or 10 years that they have managed to capture a significant portion of the high-end supercomputing market. Now, as multiprocessors make their way into the high-volume PC and workstation businesses, that research will come face-to-face with the real world.