سؤال

In a recent CACM article [1], the authors present a way to improve scalability of shared and coherent caches. The core ingredient is assuming the caches are inclusive, that is higher-level caches (e.g. L3, one global cache) contain all blocks which are stored in their descendant lower-level caches (e.g. L1, one cache per core).

Typically, higher-level caches are larger than their respective descendant caches together. For instance, some models of the Intel Core i7 series with four cores have an 8MB shared cache (L3) and 256KB private caches (L2), that is the shared cache can hold eight times as many blocks as the private caches in total.

This seems to suggest that whenever the shared cache has to evict a block (in order to load a new block) it can find a block that is shared with none of the private caches² (pigeon-hole principle). However, the authors write:

[We] can potentially eliminate all recalls, but only if the associativity, or number of places in which a specific block may be cached, of the shared cache exceeds the aggregate associativity of the private caches. With sufficient associativity, [the shared cache] is guaranteed to find a nonshared block [...]. Without this worst-case associativity, a pathological cluster of misses could lead to a situation in which all blocks in a set of the shared cache are truly shared.

How is this possible, that is how can, say, 1MB cover 8MB? Clearly I miss some detail of how such cache hierarchies work. What does "associativity" mean here? "number of places in which a specific block may be cached" is not clear; I can only come up with the interpretation that a block can be stored multiple times in each cache, but that would make no sense at all. What would such a "pathological cluster of misses" look like?


  1. Why On-Chip Cache Coherence is Here to Stay by M. M. K. Martin, M. D. Hill, D. J. Sorin (2012)
  2. Assuming the shared caches knows which blocks are shared where. This can be achieved by explicit eviction notifications and tracking bit, which is also discussed in [1].
هل كانت مفيدة؟

المحلول

Perhaps I can shed some light on associativity. These caches aren't just open blocks of memory, so don't think of them as some kind of generic container. Each block of memory has a real address (whether this is physical or virtual doesn't matter, just assume it is a fixed address). At each level of cache this memory can only be cached at very specific locations in the cache.

For example, memory address X can be stored in the L1 cache at location Y0. This is a direct mapping. But suppose it can also be stored at location Y1, then it is 2-way associative (which is fairly common). Being storable in N-locations makes it N-way associative. When the chip chooses a cache location it checks only these N places -- it does not scan the entire cache (which would be too slow/complicated).

This should clarify what they mean by saying the shared cache must have an aggregate higher associativity than the lower caches. So if L1 is 2-way, L2 is 4-way, and there are 4 processors, they shared cache must be > 24-way associative (I don't have the article, but this sounds like what they mean). This follows then, that at least one of these shared locations must not be cached in any of the private caches.

I can't say for sure what their "pathological" case is, but I can explain a common one, one that can affect things like binary search. Say the L1 cache is 1-way associative and 4KB in size. The locations where memory can be stored is likely a simple mod of the address (X mod 4K). Say then you have a large block of data which is a power of 2 (which is common). In a binary search you always chop the size in half, so a search on a 256K block would have you check address 128K, 64K, 32K, 16K, 8K, 4K, and the lower. But you should see a problem here. Those first 6 addresses all map to the exact same cache location. So each access requires an eviction and load, despite you accessing only a tiny amount of memory.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى cs.stackexchange
scroll top