Challenges to Closing the AI Memory Gap
By Steven Woo, Fellow & Distinguished Inventor, Rambus Inc.
In the race to accelerate AI application performance, several critical challenges await chip and system designers as the semiconductor industry strives to meet the demand for more performance and better power efficiency. Among the most critical of these challenges are memory system and data movement bottlenecks that serve as a stark reminder of the processor-memory gap that has impacted processor design for the last few decades. Utilizing a variety of techniques, specialized AI processors have demonstrated tremendous improvements in performance and power efficiency on these workloads. But they have also exacerbated the decades-old processor memory gap, creating an “AI-memory gap” that threatens the continued progress of AI silicon.
The Roofline Model is a modern tool that helps hardware and software developers understand how their code is working in terms of data movement and computational demands. The model also reveals useful insights into hardware bottlenecks by assessing the code performance in relation to peak theoretical performance. Analysis using the Roofline Model can pinpoint when an application is bound by limitations on memory bandwidth, compute resources, or other factors.
Figure 1. The Roofline Model plots application performance in relation to available system resources.
Figure 1 shows a Roofline model, where the Y axis represents performance as measured in operations per second, and the X axis represents arithmetic intensity, or the number of operations performed per byte of data moved from memory to the processor. The Roofline itself consists of two parts, one part that is sloped, and one part that is flat. The sloped part of the line is related to the memory bandwidth that a processor has available to it, and the flat part represents the computational resources (e.g., number of compute engines) available to the processor. Together, these two segments define the peak performance achievable by the processor when running applications that are limited by memory bandwidth, and those that are limited by computational resources. Different applications can be plotted on against a processor’s Roofline, with their position depending on how much the performance depends on memory bandwidth, or how much they depend on compute resources. Applications nearer to the sloped part of the roofline are limited more by memory bandwidth, and those that are nearer to the flat part of the roofline are limited more by the available compute resources.
Figure 2. Mapping applications against a roofline shows when workloads are limited by memory bandwidth or compute resources.
Figure 2 illustrates how different applications can map to a processor’s roofline. In this figure, Application 3 lies close to the flat part of the roofline, indicating that it is limited more by the available compute resources in the processor. This application could likely achieve even higher performance if more compute resources were available. Application 1 lies closer to the sloped part of the roofline, indicating that it is limited more by lack of memory bandwidth. This application could likely achieve higher performance if more memory bandwidth were available to the processor. And finally, Application 2 lies close to both the sloped part of the roofline and the flat part of the roofline, indicating that it is limited by both memory bandwidth and compute resources. It has been shown that for specialized AI processors, many applications are sitting closer to the sloped part of the curve, indicating that lack of memory bandwidth is increasingly limiting AI application performance. The growing deficiency in memory bandwidth compared to computation performance is resulting in an AI memory gap.
Memory-Bandwidth Challenges: Autonomous Driving
Closing this memory gap is extremely important to enable future AI and machine-learning applications in the cloud and at the edge, if future generations of services are to deliver the response times users will expect and need. The challenges presented by autonomous driving, as systems and services evolve quickly from advanced driver-assistance modes to fully self-driving vehicles, offer a complex and demanding example.
The services emerging to support higher levels of autonomous driving are bringing together intelligent systems from the cloud to the edge. In-vehicle hardware and software platforms must quickly execute intensive calculations and implement decisions in near real-time. Applications in the cloud will handle data collection and management, analytics, and the generation of new and updated services including the training of new neural networks to be deployed over the air. As sensors in and around autonomous vehicles gather more and more data, interactions between the cloud, the edge, and autonomous vehicles will grow, with data-management, analysis, simulation, and deep learning taking place in the cloud, and low-latency computation performed on intelligent edge platforms and in the cars themselves. Micron estimates that SAE Level 3 and 4 autonomous-driving systems will demand 512GB/s – 1024GB/s of memory bandwidth, with fully autonomous vehicles projected to require even higher levels.
Closing the Gap
Memory systems are struggling to keep up with the ever-increasing demands of modern processors. While GPU throughput has increased by a factor of about 32 in the past nine years, memory system bandwidth has only increased 13-fold, causing the processor-memory gap to grow. To restore balance between processing and data-delivery speeds, and thus improve application performance and efficiency, a much larger increase in memory performance is needed. Two newer memory technologies that can be considered to meet these challenges in AI and high-performance systems are high-bandwidth memory (HBM) and the latest-generation graphics memory, GDDR6.
HBM improves memory performance by utilizing die stacking and tighter integration by connecting the processor and memory with through-silicon vias (TSVs) to reduce the distance between components. Second-generation HBM2 memories can contain up to eight DRAM dies per stack, resulting in memory density of 8-16GB per package. A wide 1024-bit interface and data-transfer speeds up to 2GT/s results in memory bandwidth of 256GByte/s per DRAM device. HBM offers extremely good power efficiency compared to other memory technologies.
While solutions based on HBM can achieve very high performance and power-efficiency, this comes at the expense of higher cost and increased design complexity. Stacking of components is more complicated than placing components directly onto PCBs, the more familiar solution available to the industry. Additional components like an interposer and an additional substrate, which can be seen in figure 3, are needed to physically interconnect HBM DRAMs with the processor and the rest of the system. This calls for newer and more expensive fabrication and assembly methods.
Figure 3. HBM delivers high bandwidth and good power-efficiency in a compact form factor using newer stacking and assembly technologies that increase cost.
Figure 4. GDDR6 memory leveraging standard PCB fabrication and PHY IP core as hard macro.
Alternatively, GDDR6 delivers a good trade-off between bandwidth, power efficiency, cost, and reliability, which are all key concerns in the automotive space. Performance is more than five times that of DDR4, and GDDR6 chips are compatible with standard PCB fabrication, as can be seen in Figure 4.
GDDR6 is the successor to GDDR5 and has a maximum data rate of up to 16Gbit/s per pin, which is twice that of GDDR5, while operating at the same 1.35V external voltage. The interface supports two independent 16-bit channels resulting in a data width of 32 bits and I/O bandwidth of 64GB/s per DRAM. Only four GDDR6 chips are needed to achieve system memory bandwidth of 256GByte/s, equivalent to the HBM system shown in figure 5.
Figure 5. 256GByte/s bandwidth can be achieved with only four GDDR6 chips.
Rambus offers both a JEDEC-compliant GDDR6 PHY IP core and a JEDEC-compliant HBM2 PHY IP core, which are delivered as hard macros ready for SoC integration. These PHYs enable processors to achieve high performance and power-efficiency at different cost points and implementation complexities that fit a range of product needs.
Autonomous driving and driver assistance systems is just one of the areas in which AI is being deployed to analyze large amounts of data in near real-time. More generally, there is growing clear demand for faster application performance to support a range of newer smart services needed to run homes, offices, factories, infrastructures, and cities.
The roofline model, developed to illustrate the limitations of applications running on different processors, shows that memory bandwidth is often the limiting factor in AI applications. Closing the gap between processor performance and memory performance is vital in areas like AI and high-performance computing (HPC), requiring newer memory types that achieve much higher performance than mainstream DDR4 memory. HBM2 and GDDR6 are more modern technologies capable of much higher bandwidths, helping to close the performance gap. To choose one or the other, system designers must evaluate trade-offs between component count, power consumption, PCB-fabrication complexity, and cost to determine which solution is best for their application.