The Roofline Constraint: Why Compute Throughput Alone Cannot Fix HBM3E Bottlenecks
The single most fundamental cause of HBM3E bottlenecks in AI accelerator chips is the persistent imbalance between peak compute throughput and the memory bandwidth available to feed that compute. When an AI model’s arithmetic intensity falls below the accelerator’s “balance point” — the ratio of effective peak compute performance to effective memory bandwidth — the HBM interface becomes the limiting resource, regardless of how many compute cores are provisioned. This is formally captured by the Roofline model and is explicitly identified as the primary bottleneck type for many AI models in a 2025 patent from South China University of Technology on AI accelerator performance fault diagnosis.
This mismatch is exacerbated at large model scale. A 2025 patent from Shencun Technology (Wuxi) Ltd. on large model computation accelerator chip architecture notes that models at the scale of hundreds of billions of parameters — such as Llama3-405B and DeepSeek-V3-671B — must use full HBM storage. The integration of high-density HBM onto a single accelerator imposes significant process-technology constraints that drive up cost. While the H200/H20 141 GB HBM variants partially alleviate capacity constraints, the manufacturing complexity of integrating dense HBM continues to constrain scalability. Standards bodies such as JEDEC define the HBM3E specification, but real-world effective bandwidth is consistently lower than the rated peak.
At the operator level, a 2025 patent from Shanghai Yunsui Technology describes how AI chip operators must balance data-movement time across memory hierarchy levels — from L1 and L2 caches through to HBM — against compute time. A bottleneck occurs whenever data cannot be fetched from HBM fast enough to keep compute units active. The patent models “data movement latency hiding” strategies where simultaneous DMA transfers across cache levels can partially mask HBM latency — but only up to the physical bandwidth ceiling of the HBM interface itself.
The Roofline model characterises whether a workload is compute-bound or memory-bandwidth-bound by comparing its arithmetic intensity (operations per byte of memory traffic) against the hardware’s balance point. For most LLM inference operators, arithmetic intensity falls below this balance point, making HBM3E bandwidth — not TOPS — the binding constraint on performance.
When an AI model’s arithmetic intensity falls below the accelerator’s balance point — the ratio of effective peak compute performance to effective memory bandwidth — HBM3E bandwidth becomes the limiting resource regardless of how many compute cores are provisioned, as identified in a 2025 patent from South China University of Technology on AI accelerator performance fault diagnosis.
Physical Integration Limits: Chiplet Topology, TSVs, and Thermal Feedback
HBM3E bottlenecks extend beyond the raw bandwidth figure into the physical architecture of how HBM stacks are integrated with compute dies. In modular HBM chiplet systems, memory access requests must sometimes be forwarded across die-to-die (D2D) channels to subsequent HBM chiplets in a chain — and latency compounds along that chain. Backpressure on the D2D channel creates stalls that reduce effective bandwidth utilisation below the rated HBM3E specification. Samsung Electronics’ 2025 patent on modular HBM chiplet architecture discloses this mechanism directly, making clear that the physical topology of the chiplet interconnect is itself a bottleneck source, separate from peak pin bandwidth.
“The physical topology of HBM chiplet interconnects is itself a bottleneck source — latency compounds along the D2D channel chain, reducing effective bandwidth below rated HBM3E specifications even when individual DRAM arrays are operating correctly.”
Thermal effects compound these physical limits in a feedback loop that is unique to 3D stacked memory. Tsinghua University’s 2020 patent on 3D stacked memory optimisation for neural network accelerator chips identifies that 3D stacked memory concentrates power dissipation in physically small volumes. Frequent DRAM refresh operations in high-utilisation zones raise die temperatures, which degrades retention time and requires more frequent refreshes — consuming bandwidth for non-productive memory maintenance operations. The patent proposes mapping data with different lifetime characteristics to different physical partitions and tuning refresh frequency per partition, underscoring that thermal effects on HBM arrays directly translate into reduced effective bandwidth for compute workloads. Semiconductor thermal management standards published by organisations such as IEEE confirm that power density in 3D-stacked DRAM is among the most constrained thermal environments in modern packaging.
At multi-stack system scale, the binding constraint shifts further. A 2024 patent from Sungkyunkwan University on DNN task scheduling in multi-HMC PIM systems demonstrates that inter-HMC link bandwidth — not individual stack bandwidth — determines aggregate system throughput for deep neural network task partitioning. Communication delay variance due to HBM interconnection characteristics governs how effectively data-level parallelism can be exploited across multiple stacks. This is a direct analog to HBM3E multi-stack configurations where the chip-to-chip interconnect, not the DRAM array bandwidth, becomes the bottleneck as accelerator scale increases.
In 3D stacked HBM, thermal-induced DRAM instability triggers a refresh frequency feedback loop: higher utilisation → higher temperature → shorter retention time → more frequent refreshes → less bandwidth available for compute. Tsinghua University’s patent proposes per-partition refresh tuning as a mitigation, but the feedback loop is a structural property of dense HBM stacking.
In modular HBM chiplet architectures, latency compounds along the die-to-die (D2D) channel chain when memory access requests are forwarded to subsequent HBM chiplets, reducing effective bandwidth utilisation below the rated HBM3E specification — a finding disclosed in Samsung Electronics’ 2025 patent on modular HBM chiplet architecture.
Explore the full patent landscape for HBM3E architecture and PIM innovations across 120+ countries.
Explore HBM Patent Data in PatSnap Eureka →Irregular Access Patterns: How Transformer Workloads Underutilise HBM3E Bandwidth
Even when HBM3E’s rated bandwidth is nominally sufficient, AI accelerator workloads — particularly autoregressive LLM inference — generate access patterns that prevent efficient utilisation of that bandwidth. The root cause is the KV-cache access pattern inherent to attention mechanisms: at inference time, each generated token must retrieve key and value vectors for all previous tokens, creating memory accesses that grow linearly with sequence length and are highly irregular in memory address space. This produces poor row-buffer locality in HBM arrays and drives effective bandwidth far below the rated peak, as documented in a 2026 patent from Southeast University on 3D NAND flash-based LLM accelerator architecture.
The embedding table access problem is structurally analogous. A 2025 patent from Tata Consultancy Services on heterogeneous memory deployment for embedding tables states explicitly that “the performance bottleneck lies in the latency of embedding access.” Embedding lookups are sparse and irregular — each access pulls a small amount of data from a random HBM address — yielding poor row-buffer locality and low effective bandwidth even when nominal HBM3E bandwidth is high. Research published through ACM on recommendation system hardware has similarly identified embedding table access as the dominant memory bottleneck in production ML inference.
Non-contiguous data layout in HBM adds a further dimension to the access-pattern problem. A 2025 patent from Tencent Technology (Beijing) on inference acceleration describes how multi-round inference creates fragmented KV-cache memory layouts that reduce L2 cache hit rates and force higher-frequency HBM accesses. The patent identifies that 64-byte alignment matching GPU L2 cache line size is critical for maximising cache hit rates, and that non-contiguous KV-cache data created by multi-round inference reduces memory access efficiency and creates HBM bottlenecks. This is particularly severe in Hopper-architecture accelerators where the tensor memory accelerator (TMA) requires contiguous data layouts to achieve maximum HBM bandwidth.
Multi-round LLM inference creates non-contiguous KV-cache memory layouts in HBM that reduce GPU L2 cache hit rates and force higher-frequency HBM accesses; 64-byte alignment matching GPU L2 cache line size is required to restore efficiency, as shown in Tencent Technology’s 2025 inference acceleration patent.
On-Chip Capacity Constraints: Weight Residency, MoE Routing, and Activation Recomputation
HBM3E bottlenecks are also driven by insufficient on-chip SRAM capacity relative to the working set of AI model parameters, which forces repeated weight reloads from HBM during inference. When model weights do not fit in on-chip caches, every matrix multiplication requires fetching weights from HBM, making the operation memory-bound by definition. Google LLC’s 2022 patent on neural network accelerators with parameters resident on chip addresses this directly, disclosing a second memory bank sized to “store a sufficient amount of neural network parameters in the computing unit to allow a throughput level or higher and a latency level or lower.” The design philosophy is to keep weights on-chip to eliminate HBM traffic for weight fetches — but for large models, this is not feasible, and HBM traffic becomes unavoidable.
Mixture-of-Experts (MoE) architectures — the dominant design for parameter-efficient LLMs — create a particularly acute form of this capacity bottleneck. Only a subset of expert weights is activated per token, but all expert weights must reside in HBM, and the routing pattern requires loading different expert weights per forward pass. A 2025 patent from the University of Science and Technology of China on MoE inference offloading quantifies the scale of this problem: Mixtral 8×7B requires 94 GB of memory at half precision, far exceeding consumer GPU memory. When experts cannot fit in HBM, they must be offloaded to DRAM or SSD, and the resulting load latency reaches approximately 3.9 seconds per token in one cited case — a direct translation of HBM capacity limits into inference latency. Research published through USENIX on LLM serving systems has documented similar latency spikes from MoE weight offloading in production deployments.
The recalculation versus reloading tradeoff is analysed in Huawei Technologies’ 2025 patent on AI model generation, which describes activation recomputation as a method to reduce HBM memory occupancy. Full recomputation discards intermediate activations and recomputes them during backpropagation, consuming 30–40% additional compute but freeing HBM capacity. This tradeoff illustrates that HBM capacity is itself a bottleneck: when HBM fills up with activations, the system must either halt forward passes — creating stalls — or pay a significant compute penalty through recomputation.
Activation recomputation in large model training trades 30–40% additional compute cost to free HBM capacity by discarding intermediate activations and recomputing them during backpropagation — a tradeoff described in Huawei Technologies’ 2025 AI Model Generation Method patent, illustrating that HBM capacity is itself a binding constraint.
Search MoE weight offloading and activation recomputation patents across all major jurisdictions with PatSnap Eureka.
Search Patents in PatSnap Eureka →Key Players and the Convergence Toward Processing-in-Memory
Analysis of the 60+ patent dataset reveals a clear convergence: the dominant structural response to HBM3E bandwidth and capacity limits is processing-in-memory (PIM) — moving computation into the HBM stack itself to eliminate the data-movement bottleneck at its source. Samsung Electronics, SK Hynix, KAIST, and Google LLC are the leading institutional innovators, each approaching the problem from a different layer of the stack.
Samsung Electronics
Samsung is the most prominent hardware-level innovator, with patents spanning modular HBM chiplet architectures (2025) and PIM-capable memory modules that execute neural network operations inside HBM stacks to reduce data movement (2024). Samsung’s approach addresses HBM3E bottlenecks by moving compute to memory rather than moving data to compute — a fundamental architectural inversion relative to conventional accelerator design.
SK Hynix
SK Hynix contributes at the PIM-architecture level with a 2025 patent disclosing PIM devices that store and process key/value vectors for attention operations directly within HBM banks, targeting the KV-cache access bottleneck at its root. This is a direct architectural response to the irregular access pattern problem described in the previous section.
KAIST
Korea Advanced Institute of Science and Technology (KAIST) is the most prolific academic contributor, with multiple patents on NPU-PIM heterogeneous acceleration for batched LLM inference (2025), in-memory computing for CNN operations (2023), and super-pipelined PIM accelerator architectures with local error prediction (2025). KAIST’s work spans from the circuit level to system-level scheduling, making it the broadest academic contributor in the dataset.
Google LLC
Google addresses HBM3E bottlenecks at the system and compiler level, with patents on hardware-optimised neural architecture search (2022), application-specific ML accelerator tuning (2023), and on-chip memory arbitration for neural network compute tiles (2025). Google’s approach is to co-design the model architecture and hardware to reduce the arithmetic intensity mismatch that drives the Roofline bottleneck — a software-hardware co-optimisation strategy that complements the hardware-level PIM approaches of Samsung and SK Hynix. The Semiconductor Industry Association has identified PIM and chiplet-based integration as among the highest-priority research directions for post-Moore’s Law AI hardware.
Chinese Academic Institutions
Tsinghua University, Peking University, Southeast University, and Fudan University collectively focus on hybrid memory hierarchies, 3D NAND-based accelerators, DRAM-PIM speculative inference, and RRAM/SRAM in-memory training — addressing HBM bottlenecks by reducing reliance on external HBM through near-memory and in-memory computation. Secondary trends across the dataset include hardware-aware operator scheduling, adaptive memory hierarchy management, and model compression techniques that reduce HBM traffic at the algorithmic level.
Both Samsung Electronics (2024) and SK Hynix (2025) have patented processing-in-memory (PIM) architectures that execute neural network operations — including attention key/value processing — directly within HBM stacks, representing the primary structural industry response to HBM3E data-movement bottlenecks in AI accelerator chips.