Page Coloring uses the operating system to control the mapping of virtual to physical address by controlling the physical memory page which virtual pages are allocated to. Coloring becomes useful when the page number is involved in cache indexing, i.e., for large, low associativity, first level caches: exactly the trends the industry is following.
Figure 3: Uncolored distribution.
Figure 4: Uncolored allocation.
For these large physically-indexed caches, the use of the low-order
bits of the physical page number (the color) in cache indexing involve
the virtual to physical translation in the location of data in the
cache. While the virtual address space is laid out very uniformly,
the lack of control over the virtual-to-physical mapping by most
operating systems causes the physical color (and therefore cache
placement modulo the page size) to be ``randomly'' selected. Thus,
depending on the order in which the physical pages are assigned, cache
miss rate can vary significantly. Figure 3
shows the distribution of miss rate of 200 executions with different
placements of virtual pages into physical pages. Note the factor of 4
between the best and worst miss rates. The small circles in the
plot (their vertical location is meaningless) indicate the mean of
this distribution, and the miss rate of a virtually-indexed cache
(where the virtual color indexes the cache, rather than the physical
color). The mean of the uncolored allocation is worse than for a
virtually-indexed cache, and the variation is very high, although
sometimes (about 30% of the time in this example) the uncolored cache
out-performs the virtually-indexed cache, as a placement of pages in
the cache superior to the location in virtual memory is happened upon.
Figure 4
shows the cycles-per-instruction (CPI) effect of the cache miss rate
on the overall system. An excess CPI of one due to the cache for a one-CPI
machine gives a total CPI of two, and a machine which is twice as
slow as one with no cache delays (misses). The black circles indicate
6 different page placements for each cache size, and the black line is
the mean of these 6 runs. The dashed line is the performance of a
virtually-indexed cache. A machine with a 64KB cache running the
example benchmark would run at
CPI with a
virtually-indexed cache, but with a mean of
CPI for
uncolored physically-indexed caches, or more than 10% slower.
Sometimes, the benchmarks would run fast, and sometimes as slow as
CPI or 25% slower than expected. Over the entire range of
caches shown, the uncolored physically-indexed cache performs
significantly worse than a virtually-indexed cache, and also have
significant variation, sometimes exceeding an order of magnitude in
performance difference!
Figure 5: Round robin coloring.
For physically indexed caches, coloring can greatly improve the average miss rate, and reduce the variation between runs. Figure 5 demonstrates the results of a round-robin coloring heuristic, which provides good performance and no variance. Note that the dashed line, representing the performance of virtually-indexed caches, is labeled `Perfect Static Coloring' in this graph. Static coloring, where the virtual and physical color are forced to be equal, allows the cache to be indexed virtually for speed, and physically for operating system simplicity. Essentially, the virtual-to-physical translation is not required before cache indexing, yet can proceed in parallel with it. Additionally, the cache miss rate is precisely that for a virtually-indexed cache, as the physical color is forced by the operating system to match the virtual color.
For virtually-indexed cache hardware, static coloring is an appealing operating system paradigm for solving virtually-indexed cache aliasing problems (converting it to a physically-indexed cache), and has been shown to allow hashing and aliasing to co-exist. Hashing is becoming more important as first-level cache sizes increase dramatically. Additionally, static coloring greatly simplifies the use of virtually-indexed caches in multi-processor systems, where physical addresses on the bus can index the processor without reverse address translation (physical to virtual).
Over 30,000 simulations of caches and virtual memory systems have been run to test uncolored (conventional) page allocation. The orthogonal effects several parameters such as line size, associativity, and page size were tested for each of six benchmarks. Several coloring heuristics have been tested, some of which involved potential hardware feedback. Full results were developed for unified, data, and instruction first-level caches.
A paper, ``The Effect of Page Allocation on Caches'' [15], was presented at Micro-25 in December. Another paper, ``Cache Improvements through Colored Page Allocation'' [14], has been submitted to ACM Transactions on Computer Systems. Additionally, a dissertation, ``The Interaction of Virtual Memory and Cache Memory,'' [16] nears publication.
Some of the contributions and conclusions of the research follow.
The desire for low effective-latency memory systems inexorably leads to caches with lower associativity for low latency and larger size for low miss rates. Lower associativity and larger size caches in turn inevitably lead to cache index sizes larger than the virtual-memory-system page size. Coloring requires that the address bits used to address a page and to address a cache have a fixed relationship.
There is a precise taxonomy for page coloring. Perfect coloring always holds; imperfect allows lapses. Static coloring allocates pages based on virtual address; dynamic coloring allocates pages based on run-time allocation history.
For physically-indexed caches, this larger index must mean delay during address translation, or perfect static page coloring to increase the number of effectively untranslated bits available for indexing. To reduce high variation in inter-run miss rates, and therefore overall system performance, page coloring becomes a requirement for physically-indexed cache systems regardless of latency considerations. Page coloring's necessity in such physically-indexed systems regardless of indexing requirements indicates that coloring should be used to also allow early cache indexing. Either perfect static coloring or round robin coloring reduce miss rate variation, but perfect static coloring would also allow virtual cache indexing, reducing overall latency.
Virtually-indexed caches, or perfect static colored physically-indexed caches, do not require page coloring for inter-run miss-rate consistency. However, as cache sizes increase, the compiler's allocation of data to fixed virtual addresses will require hashing to make effective use of the virtually-indexed cache. Perfect static coloring provides a framework within which virtual indexing and hashing can coexist without preventing aliasing. Perfect static coloring also simplifies cache snooping in multiprocessor systems because the first-level cache can be both virtually-indexed from the processor side and physically-indexed from the snooping (bus) side. Perfect static coloring also enhances inclusion in second level caches by providing more effectively untranslated address bits.
The distributions of miss rates for uncolored caches show significant variation. Data caches distributions were fairly wide, while instruction cache distributions had more instances near the mean but the few colorings producing miss rates away from the mean could be far from it. Perfect static page coloring and round robin page coloring both reduce the variation in miss rate to the minimum, i.e., none, and while providing good miss rate performance, did not necessarily provide optimal miss rate performance. For medium sized caches, perfect static coloring provided miss rate results superior to round robin coloring, but for small unified or data caches, or large unified caches, round robin coloring provided better colorings. For systems with reasonabe main memory sizes relative to their cache size, the page faults simulations reported herein showed little if any deterioration in page fault rate to oppose these early indexing and miss-rate consistency advantages.
A well-written operating system should easily implement either dynamic or static page coloring. Page colors must be checked only on page allocation and during any remapping of virtual to physical addresses.
Perfect static coloring provides significant advantages in aliasing and hashing while allowing the low latency of virtually-indexed caches. Multiprocessor systems also benefit through increased second-level inclusion, and physical-indexing of the first level cache for snooping. Considering these varied advantages, the low impact on page fault rate, and the ease of implementation, perfect static page coloring deserves serious consideration in any computer system memory design.