¿Por qué el tamaño del caché L1 menor que el de la caché L2 en la mayoría de los procesadores?

https://stackoverflow.com/questions/4666728

10-10-2019
|

Pregunta

Solución

Existen diferentes razones para ello.

L2 existe en el sistema de speedup el caso en que hay un error de caché L1. Si el tamaño de L1 era igual o más grande que el tamaño de la L2, L2 entonces no podía acomodar para más líneas de caché además de L1, y no sería capaz de hacer frente a fallos de caché L1. Desde la perspectiva del diseño / coste, caché L1 está ligada al procesador y más rápido que L2. La idea de cachés es que a acelerar el acceso al hardware más lento añadiendo hardware intermedia que está llevando a cabo más (y costoso) que el hardware más lento y todavía más barato que el hardware más rápido que tienes. Incluso si usted decidió duplicar la caché L1, L2 también le incremento, a speedup pierde L1-caché.

Así que ¿por qué hay caché L2 en absoluto? Bueno, caché L1 es por lo general con más prestaciones y caros de construir, y se une a un solo núcleo. Esto significa que el aumento del tamaño L1 por una cantidad fija tendrá que costo multiplican por 4 en un procesador de doble núcleo, o por 8 en una de cuatro núcleos. L2 es generalmente compartida por diferentes núcleos --depending en la arquitectura que se puede compartir entre una pareja o todos los núcleos en el procesador, por lo que el costo de aumentar L2 sería menor incluso si el precio de L1 y L2 eran los mismos --que no lo es.

Otros consejos

L1 está muy fuertemente acoplado al núcleo de la CPU, y se accede en cada acceso a la memoria (muy frecuente). Por lo tanto, tiene que devolver los datos muy rápido (por lo general dentro de ciclo de reloj). Latencia y rendimiento (ancho de banda) son a la vez rendimiento crítico para la caché de datos L1. (Por ejemplo, cuatro latencia ciclo, y el apoyo a dos lee y uno de escritura por el núcleo de la CPU de cada ciclo de reloj). Se necesita una gran cantidad de puertos de lectura / escritura para apoyar este ancho de banda de acceso de alta. La construcción de un gran caché con estas propiedades es imposible. Por lo tanto, los diseñadores que sea pequeña, por ejemplo, 32KB en la mayoría de los procesadores de hoy en día.

L2 se accede sólo en área L1, por lo que los accesos son menos frecuentes (por lo general 1 / 20a de la L1). Por lo tanto, L2 puede tener una mayor latencia (por ejemplo de 10 a 20 ciclos) y tienen menos puertos. Esto permite a los diseñadores para hacerlo más grande.

L1 y L2 desempeñan papeles muy diferentes. Si L1 se hace más grande, además de aumentar L1 latencia de acceso que reducirá drásticamente el rendimiento, ya que hará que todas las cargas que dependen más lento y más difícil para la ejecución fuera de fin de ocultar. el tamaño de L1 se apenas discutible.

Si nos eliminado L2, L1 pierde tendrá que ir al siguiente nivel, digamos memoria. Esto significa que una gran cantidad de acceso se va a la memoria lo que implicaría que necesitan más ancho de banda de memoria, que ya es un cuello de botella. Por lo tanto, manteniendo la L2 alrededor es favorable.

Los expertos a menudo se refieren a L1 como un filtro de latencia (como lo hace el caso común de L1 golpea más rápido) y L2 como un filtro de ancho de banda, ya que reduce el consumo de ancho de banda de memoria.

Nota: He asumido un nivel de 2- caché jerarquía en mi argumento para hacerlo más sencillo. En muchos de los chips multinúcleo de hoy en día, hay una memoria caché L3 compartida entre todos los núcleos, mientras que cada núcleo tiene su propia L1 L2 privada y tal vez. En estos chips, la caché de último nivel compartida (L3) juega el papel de filtro de ancho de banda de memoria. L2 desempeña el papel de filtro de ancho de banda en el chip, es decir, que reduce el acceso a la interconexión en el chip y el L3. Esto permite a los diseñadores utilizar una interconexión de menor ancho de banda como un anillo, y una lenta solo puerto L3, lo que les permite hacer L3 más grande.

Tal vez la pena mencionar que el número de puertos es un punto de diseño muy importante porque afecta a la cantidad de área de chip que consume el caché. Puertos añadir cables a la caché que consume una gran cantidad de área de chip y el poder.

@Aater's answer explains some of the basics. I'll add some more details + an examples of the real cache organization on Intel Haswell and AMD Piledriver, with latencies and other properties, not just size.

For some details on IvyBridge, see my answer on "How can cache be that fast?", with some discussion of the overall load-use latency including address-calculation time, and widths of the data busses between different levels of cache.

L1 needs to be very fast (latency and throughput), even if that means a limited hit-rate. L1d also needs to support single-byte stores on almost all architectures, and (in some designs) unaligned accesses. This makes it hard to use ECC (error correction codes) to protect the data, and in fact some L1d designs (Intel) just use parity, with better ECC only in outer levels of cache (L2/L3) where the ECC can be done on larger chunks for lower overhead.

It's impossible to design a single level of cache that could provide the low average request latency (averaged over all hits and misses) of a modern multi-level cache. Since modern systems have multiple very hungry cores all sharing a connection to the same relatively-high latency DRAM, this is essential.

Every core needs its own private L1 for speed, but at least the last level of cache is typically shared, so a multi-threaded program that reads the same data from multiple threads doesn't have to go to DRAM for it on each core. (And to act as a backstop for data written by one core and read by another). This requires at least two levels of cache for a sane multi-core system, and is part of the motivation for more than 2 levels in current designs. Modern multi-core x86 CPUs have a fast 2-level cache in each core, and a larger slower cache shared by all cores.

L1 hit-rate is still very important, so L1 caches are not as small / simple / fast as they could be, because that would reduce hit rates. Achieving the same overall performance would thus require higher levels of cache to be faster. If higher levels handle more traffic, their latency is a bigger component of the average latency, and they bottleneck on their throughput more often (or need higher throughput).

High throughput often means being able to handle multiple reads and writes every cycle, i.e. multiple ports. This takes more area and power for the same capacity as a lower-throughput cache, so that's another reason for L1 to stay small.

L1 also uses speed tricks that wouldn't work if it was larger. i.e. most designs use Virtually-Indexed, Physically Tagged (VIPT) L1, but with all the index bits coming from below the page offset so they behave like PIPT (because the low bits of a virtual address are the same as in the physical address). This avoids synonyms / homonyms (false hits or the same data being in the cache twice, and see Paul Clayton's detailed answer on the linked question), but still lets part of the hit/miss check happen in parallel with the TLB lookup. A VIVT cache doesn't have to wait for the TLB, but it has to be invalidated on every change to the page tables.

On x86 (which uses 4kiB virtual memory pages), 32kiB 8-way associative L1 caches are common in modern designs. The 8 tags can be fetched based on the low 12 bits of the virtual address, because those bits are the same in virtual and physical addresses (they're below the page offset for 4kiB pages). This speed-hack for L1 caches only works if they're small enough and associative enough that the index doesn't depend on the TLB result. 32kiB / 64B lines / 8-way associativity = 64 (2^6) sets. So the lowest 6 bits of an address select bytes within a line, and the next 6 bits index a set of 8 tags. This set of 8 tags is fetched in parallel with the TLB lookup, so the tags can be checked in parallel against the physical-page selection bits of the TLB result to determine which (if any) of the 8 ways of the cache hold the data.

Making a larger L1 cache would mean it had to either wait for the TLB result before it could even start fetching tags and loading them into the parallel comparators, or it would have to increase in associativity to keep log2(sets) + log2(line_size) <= 12. (More associativity means more ways per set => fewer total sets = fewer index bits). So e.g. a 64kiB cache would need to be 16-way associative: still 64 sets, but each set has twice as many ways. This makes increasing L1 size beyond the current size prohibitively expensive in terms of power, and probably even latency.

Spending more of your power budget on L1D cache logic would leave less power available for out-of-order execution, decoding, and of course L2 cache and so on. Getting the whole core to run at 4GHz and sustain ~4 instructions per clock (on high-ILP code) without melting requires a balanced design. See this article: Modern Microprocessors: A 90-Minute Guide!.

The larger a cache is, the more you lose by flushing it, so a large VIVT L1 cache would be worse than the current VIPT-that-works-like-PIPT. And a larger but higher-latency L1D would probably also be worse.

According to @PaulClayton, L1 caches often fetch all the data in a set in parallel with the tags, so it's there ready to be selected once the right tag is detected. The power cost of doing this scales with associativity, so a large highly-associative L1 would be really bad for power-use as well as die-area (and latency). (Compared to L2 and L3, it wouldn't be a lot of area, but physical proximity is important for latency. Speed-of-light propagation delays matter when clock cycles are 1/4 of a nanosecond.)

Slower caches (like L3) can run at a lower voltage / clock speed to make less heat. They can even use different arrangements of transistors for each storage cell, to make memory that's more optimized for power than for high speed.

There are a lot of power-use related reasons for multi-level caches. Power / heat is one of the most important constraints in modern CPU design, because cooling a tiny chip is hard. Everything is a tradeoff between speed and power (and/or die area). Also, many CPUs are powered by batteries or are in data-centres that need extra cooling.

L1 is almost always split into separate instruction and data caches. Instead of an extra read port in a unified L1 to support code-fetch, we can have a separate L1I cache tied to a separate I-TLB. (Modern CPUs often have an L2-TLB, which is a second level of cache for translations that's shared by the L1 I-TLB and D-TLB, NOT a TLB used by the regular L2 cache). This gives us 64kiB total of L1 cache, statically partitioned into code and data caches, for much cheaper (and probably lower latency) than a monster 64k L1 unified cache with the same total throughput. Since there is usually very little overlap between code and data, this is a big win.

L1I can be placed physically close to the code-fetch logic, while L1D can be physically close to the load/store units. Speed-of-light transmission-line delays are a big deal when a clock cycle lasts only 1/3rd of a nanosecond. Routing the wiring is also a big deal: e.g. Intel Broadwell has 13 layers of copper above the silicon.

Split L1 helps a lot with speed, but unified L2 is the best choice. Some workloads have very small code but touch lots of data. It makes sense for higher-level caches to be unified to adapt to different workloads, instead of statically partitioning into code vs. data. (e.g. almost all of L2 will be caching data, not code, while running a big matrix multiply, vs. having a lot of code hot while running a bloated C++ program, or even an efficient implementation of a complicated algorithm (e.g. running gcc)). Code can be copied around as data, not always just loaded from disk into memory with DMA.

Caches also need logic to track outstanding misses (since out-of-order execution means that new requests can keep being generated before the first miss is resolved). Having many misses outstanding means you overlap the latency of the misses, achieving higher throughput. Duplicating the logic and/or statically partitioning between code and data in L2 would not be good.

Larger lower-traffic caches are also a good place to put pre-fetching logic. Hardware pre-fetching enables good performance for things like looping over an array without every piece of code needing software-prefetch instructions. (SW prefetch was important for a while, but HW prefetchers are smarter than they used to be, so that advice in Ulrich Drepper's otherwise excellent What Every Programmer Should Know About Memory is out-of-date for many use cases.)

Low-traffic higher level caches can afford the latency to do clever things like use an adaptive replacement policy instead of the usual LRU. Intel IvyBridge and later CPUs do this, to resist access patterns that get no cache hits for a working set just slightly too large to fit in cache. (e.g. looping over some data in the same direction twice means it probably gets evicted just before it would be reused.)

A real example: Intel Haswell. Sources: David Kanter's microarchitecture analysis and Agner Fog's testing results (microarch pdf). See also Intel's optimization manuals (links in the x86 tag wiki).

Also, I wrote up a separate answer on: Which cache mapping technique is used in intel core i7 processor?

Modern Intel designs use a large inclusive L3 cache shared by all cores as a backstop for cache-coherence traffic. It's physically distributed between the cores, with 2048 sets * 16-way (2MiB) per core (with an adaptive replacement policy in IvyBridge and later).

The lower levels of cache are per-core.

L1: per-core 32kiB each instruction and data (split), 8-way associative. Latency = 4 cycles. At least 2 read ports + 1 write port. (Maybe even more ports to handle traffic between L1 and L2, or maybe receiving a cache line from L2 conflicts with retiring a store.) Can track 10 outstanding cache misses (10 fill buffers).
L2: unified per-core 256kiB, 8-way associative. Latency = 11 or 12 cycles. Read bandwidth: 64 bytes / cycle. The main prefetching logic prefetches into L2. Can track 16 outstanding misses. Can supply 64B per cycle to the L1I or L1D. Actual port counts unknown.
L3: unified, shared (by all cores) 8MiB (for a quad-core i7). Inclusive (of all the L2 and L1 per-core caches). 12 or 16 way associative. Latency = 34 cycles. Acts as a backstop for cache-coherency, so modified shared data doesn't have to go out to main memory and back.

Another real example: AMD Piledriver: (e.g. Opteron and desktop FX CPUs.) Cache-line size is still 64B, like Intel and AMD have used for several years now. Text mostly copied from Agner Fog's microarch pdf, with additional info from some slides I found, and more details on the write-through L1 + 4k write-combining cache on Agner's blog, with a comment that only L1 is WT, not L2.

L1I: 64 kB, 2-way, shared between a pair of cores (AMD's version of SMD has more static partitioning than Hyperthreading, and they call each one a core. Each pair shares a vector / FPU unit, and other pipeline resources.)
L1D: 16 kB, 4-way, per core. Latency = 3-4 c. (Notice that all 12 bits below the page offset are still used for index, so the usual VIPT trick works.) (throughput: two operations per clock, up to one of them being a store). Policy = Write-Through, with a 4k write-combining cache.
L2: 2 MB, 16-way, shared between two cores. Latency = 20 clocks. Read throughput 1 per 4 clock. Write throughput 1 per 12 clock.
L3: 0 - 8 MB, 64-way, shared between all cores. Latency = 87 clock. Read throughput 1 per 15 clock. Write throughput 1 per 21 clock

Agner Fog reports that with both cores of a pair active, L1 throughput is lower than when the other half of a pair is idle. It's not known what's going on, since the L1 caches are supposed to be separate for each core.

For those interested in this type of questions, my university recommends Computer Architecture: A Quantitative Approach and Computer Organization and Design: The Hardware/Software Interface. Of course, if you don't have time for this, a quick overview is available on Wikipedia.

I think the main reasone for this ist, that L1-Cache is faster and so it's more expensive.

The other answers here give specific and technical reasons why L1 and L2 are sized as they are, and while many of them are motivating considerations for particular architectures, they aren't really necessary: the underlying architectural pressure leading to increasing (private) cache sizes as you move away from the core is fairly universal and is the same as the reasoning for multiple caches in the first place.

The three basic facts are:

The memory accesses for most applications exhibit a high degree of temporal locality, with a non-uniform distribution.
Across a large variety of process and designs, cache size and cache speed (latency and throughput) can be traded off against each other¹.
Each distinct level of cache involves incremental design and performance cost.

So at a basic level, you might be able to say double the size of the cache, but incur a latency penalty of 1.4 compared to the smaller cache.

So it becomes an optimization problem: how many caches should you have and how large should they be? If memory access was totally uniform within the working set size, you'd probably end up with a single fairly large cache, or no cache at all. However, access is strongly non-uniform, so a small-and-fast cache can capture a large number of accesses, disproportionate to it's size.

If fact 2 didn't exist, you'd just create a very big, very fast L1 cache within the other constraints of your chip and not need any other cache levels.

If fact 3 didn't exist, you'd end up with a huge number of fine-grained "caches", faster and small at the center, and slower and larger outside, or perhaps a single cache with variable access times: faster for the parts closest to the core. In practice, rule 3 means that each level of cache has an additional cost, so you usually end up with a few quantized levels of cache².

Other Constraints

This gives a basic framework to understand cache count and cache sizing decisions, but there are secondary factors at work as well. For example, Intel x86 has 4K page sizes and their L1 caches use a VIPT architecture. VIPT means that the size of the cache divided by the number of ways cannot be larger³ than 4 KiB. So an 8-way L1 cache as used on the half dozen Intel designs can be at most 4 KiB * 8 = 32 KiB. It is probably no coincidence that that's exactly the size of the L1 cache on those designs! If it weren't for this constraint, it is entirely possible you'd have seen lower-associativity and/or larger L1 caches (e.g., 64 KiB, 4-way).

¹ Of course, there are other factors involved in the tradeoff as well, such as area and power, but holding those factors constant the size-speed tradeoff applies, and even if not held constant the basic behavior is the same.

² In addition to this pressure, there is a scheduling benefit to known-latency caches, like most L1 designs: and out-of-order scheduler can optimistically submit operations that depend on a memory load on the cycle that the L1 cache would return, reading the result off the bypass network. This reduces contention and perhaps shaves a cycle of latency off the critical path. This puts some pressure on the innermost cache level to have uniform/predictable latency and probably results in fewer cache levels.

³ In principle, you can use VIPT caches without this restriction, but only by requiring OS support (e.g., page coloring) or with other constraints. The x86 arch hasn't done that and probably can't start now.

logically, the question answers itself.

If L1 were bigger than L2 (combined), then there would be no need of L2 Cache.

Why would you store your stuff on tape-drive if you can store all of it on HDD ?

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow