We’ve gotten used to measuring CPU cache performance parameters and RAM latencies, so why not do the same for GPUs? Like CPUs, GPUs have evolved to use multi-level cache hierarchies to address the growing gap between computing performance (GPU) and memory (VRAM), and just like CPUs, we can use dots pointer lookup reference (in OpenCL) to measure the cache latency and VRAM of the graph.
VRAM Latency on Ampere and RDNA Graphics 2
The cache in AMD RDNA 2 graphics cards is very fast and there is a lot of it. Compared to Ampere, latency is lower at all levels, and Infinity Cache it only adds around 20nm compared to L2 cache and has lower latency than Ampere’s L2. Surprisingly, RDNA 2’s VRAM latency is roughly the same as NVIDIA’s Ampere, although RDNA 2 performs two additional levels of cache checking on the way to memory.
In contrast, NVIDIA is left with a more conventional memory subsystem with only two levels of cache and high latency on L2. Going from L1 dedicated to Ampere SMs to L2 takes more than 100 ns; RDNA 2’s L2 cache is around 66ns of L0, even with an L1 cache between them. Bypassing the huge die of the GA102 chip seems to take many cycles for Ampere GPUs, penalizing their performance.
This could explain the excellent performance that AMD’s RDNA 2 graphics provide at lower resolutions. RDNA 2’s low latency L2 and L3 caches can give you an advantage with smaller workloads, where occupancy is too low to obfuscate latency. In comparison, Ampere chips require more parallelism in order to stand out in terms of performance.
If we compare CPU and GPU, we see a massacre
CPUs are designed to run serious workloads as fast as possible, while GPUs are designed to run massive workloads in parallel. Since the test has been done with OpenCL, we can run it without modification on a CPU to see how it compares to a GPU.
In the example above, a Haswell processor is used whose cache and DRAM latencies are so low that a logarithmic scale has had to be used, since otherwise it would be seen as a flat line well below the RDNA 2 figures. Core i7-4770 with 1600MHz DDR3 used CL9 can do a memory round trip in just 63ns, while a Radeon RX 6900 XT with GDDR6 takes 226ns to do the same process, over 3.5 more times.
However, from another perspective, the latency of the GDDR6 VRAM itself is not that bad. A CPU or GPU has to check the cache before putting it into memory, and therefore we can get a more ‘raw’ view of memory latency just by looking at how long it takes for data to go into memory from that hits the cache. The delta between a hit and a last-level cache error is 53.42 ns in Haswell and 123.2 ns in RDNA 2.
What about previous generation GPUs?
The Maxwell and Pascal architectures are very similar, and a GTX 980 Ti is likely to suffer due to a larger die and lower clock speeds, so data takes longer to physically pass through the chip. NVIDIA does not allow OpenCL to use the L1 texture cache on any of the architectures, so unfortunately the first thing you see is the L2 cache.
Turing begins to look more like Ampere; there’s relatively low L1 latency, then L2, and finally memory. L2 latency seems more or less in line with Pascal, while raw memory latency looks similar up to 32MB as well, and then goes higher.
As for AMD, there is no explanation for the latency being so low below 32 KB. AMD says Terascale has an 8 KB L1 data cache but the results don’t match; the test might be hitting some kind of vertex reuse cache (since memory payloads are compiled into vertex fetch clauses).
GNC and RDNA 2 look as expected, and it is quite interesting to see that AMD’s latency at all levels decreases as time passes.