The second major change related to the new year’s model is using a multi-die package. Probably the same chiplet approach that AMD used in its recent Zen 2 and Zen 3 CPUs is all encompassing and applicable to GPUs, with other improvements. The two CDNA dies are linked together by Infinity, with 25 Gbps connecting the GPUs at a speed of up to 100 GBps of bidirectional bandwidth. There are eight links in the DI200 OAM – Open Computer Platform (API), which delivers 800 GB in bandwidth between the two chips. If you want to know more, then it’s a huge increase. Infinity Fabric combines RAM with a speed of 1600 MHz, while a high-end chip has reached up to 2000 MHz. With 16 bytes and two data rates, that is only 51.2 GBps of bi-directional bandwidth; then MI200 has 16 times the interlink bandwidth. That’s essentially AMD’s answer to Intel’s EMIB. MI200 will also use the N6 node of the TSMC, an improved version of the N7. N6 is a relatively small evolution in process technology, but enables better clocks and efficiency. It’s compatible with the N7 design rules, so it’s a lot easier for a company like AMD to port a design from N7 to N6. A surprising piece of information comes from the fact that the whole MI200 chip has 58 billion transistors. I think it is a lot, but Nvidia’s A100 (the benchmark for a single GPU) has 54,2 billion transistors. If we weren’t able to get something wrong, the total size of the MI200 chips is about the same size as Nvidia’s A100, but they can also drive a lot more compute performance into that area. Despite the specs released by AMD, MI200 clocks at 1,7 GHz, compared to 1,5 GHz on the DS100. The memory was upgraded to HBM2e, running at 3,2 Gbps, and, combined with the dual-chiplet GPU layout, the total bandwidth for MI200 increased from 1,2 TBps to 3,2 TBps. But that’s just starting to start. Despite the fact that many of MI200’s core functional units are similar to MI100, the vector FP64 and FP32 rates continue to be unchanged, as well as the matrix FP16 and FP32 rates in the vector and masked unit. Both matrix and vector computations have support for FP64, and the vector’s rate of computation has been doubled. In fact, when the GPU started to do the counting, the IHC was the first to provide an estimated twenty-ten million pounds of FP64 computing power. Since the clock is higher, dual-GPUs and double FP64 rates, the MI200 has a vector rate of 47.9 TFLOPS, so AMD quickly realized that this represents a 4.9X increase of Nvidia A100 and FP64 vector rates. MI200 also adds FP64 to its matrix, with a peak rate that is double the vector unit rate: 95,7 TFLOPS. Again, as a comparison shows, the Nvidia A100 FP64 vector performance is 19,5 TFLOPS. It’s on paper, of course, so we have to see how that translates to real life. AMD claims performance is about three times faster than the A100 in several jobs, although it’s difficult to say if that’s all for those with specific workloads. On the FP16 side, the performance isn’t quite as high as expectations. Nvidia’s A100 has 312 TFLOPS of FP16/BF16 computed, the same way that 383 TFLOPS of the A200, Nvidia also has sparsity. Basically, the GPU can skip some operations, e.g. multiplication by zero (which my math teacher taught me always is zero). The A100’s performance can be doubled with sparsity, so it should be applied when Nvidia maintains the lead. There are still few key information missing such as power regulations. The Nvidia A100 has a TDP of 400W for the SXM, a direct competition for the A350. Rumors say the MI250 OEM would be able to have a TDP of up to 550 W. For connectivity, Nvidia uses NVLink and AMD uses its Infinity Fabric. The OAM is probably at least six way modes. The image above comes from AMD’s slide deck and shows what appears to be a single node in the Frontier, Oak Ridge. Assuming it is an accurate picture, there will be six mini-camera GPUs paired with dual EPYC CPUs. Thomas Zacharia says a single MI200 GPU provides more compute performance than an entire node in the previous summit supercomputer. Frontier is currently being installed and will be available to researchers starting next year. AMD has two models that are currently planned for the MI200 OAM. The lower performance of the three-digit model, which we used for the majority of the discussion, has 110 chips, while the lower-cost, five-year-old has 104 chips. This is the only real change, so the MI250 gets 5% less compute performance. In the future there’s also a PCIe version of the MI200. There’s still plenty of information to digest from AMD’s accelerated data center premiere keynote, and we’ve covered the EPYC Genoa and Bergamo GPUs elsewhere. If Intel’s Alder Lake CPUs allow the existing competition from AMD’s existing consumer line, its data center offerings still seem so powerful. The full slide deck of the MI200 section is shown in the gallery a bit below. Image 1:12 o’clock. Graphix 2 of 12:24. Image 3 of 12:13. Photo 1/13 of the year 4 of 12 by Simon G. W.V. Image 5 of 12:55. In this photograph, I took 6 of 12 people. A photo of the year 7 of the year 12 shows the young man. See below. 9 of 12 photograph. Image 10, 12:40 Image 11 of 12 is provided. Image 12 of 12 below.