Compute-in-Memory for Transformer Inference — PatSnap Eureka
How Compute-in-Memory Eliminates Data Movement Energy in Transformer Inference
In large language model inference, data movement between compute units and memory can account for more than 80% of total system energy. CIM architectures perform multiply-accumulate operations directly within the memory array — eliminating the dominant energy cost at its source.
Why Transformer Inference Hits the Memory Wall
Transformer models impose a uniquely challenging memory access pattern on conventional hardware. The self-attention mechanism requires computing scaled dot-product attention over queries, keys, and values — operations whose memory bandwidth demands scale quadratically with sequence length. Conventional accelerators must continuously load weight matrices from off-chip DRAM through bandwidth-limited memory buses, causing the majority of system energy to be spent on data movement rather than arithmetic.
As documented in a 2026 Shanghai Jiao Tong University patent on heterogeneous LLM inference chiplet simulation, in large language model inference the repeated movement of weights and activations between compute units and multi-level caches can account for more than 80% of total system energy, severely constraining both energy efficiency and scalability.
Existing processors designed for convolutional neural networks cannot efficiently handle the unique computation flow of scaled dot-product attention, as noted in Zhejiang University's 2025 ReRAM-based process-in-memory patent. The fundamental mismatch is architectural: the von Neumann bottleneck was tolerable for CNN workloads but becomes crippling for transformer models where the KV-cache must be accessed at every token generation step.
CIM technology addresses this by embedding multiply-accumulate (MAC) units directly within memory arrays — whether SRAM, RRAM, or eDRAM — so that matrix-vector multiplications are executed in-situ, eliminating the dominant source of energy dissipation.
How CIM Architectures Eliminate Data Movement at the Hardware Level
Five distinct hardware mechanisms have been patented for transformer inference acceleration, each targeting a different layer of the data movement problem.
Proximity Cell MAC-Tree in 3T-eDRAM
KAIST's Digital 3T-eDRAM CIM Macro stores weight data in sub-arrays that perform simultaneous read operations. An operator built as a "proximity cell MAC tree" performs multiply-accumulate operations directly on weight data read from the array and activation data injected through a macro controller. By keeping weight data resident in the eDRAM array and performing MAC operations at the array periphery rather than after a full memory read cycle, the architecture avoids repeated round-trips to external DRAM during transformer layer computations.
In-array MAC at peripheryVoltage-Capacitance Dual Coding with Charge Reuse
KAIST's energy-efficient in-memory computation processor performs analog computation through voltage-capacitance dual coding, computation word line charge reuse, and a signal amplification pipeline culminating in time-to-digital conversion. The charge reuse mechanism is particularly significant: instead of fully resetting and recharging memory bitlines for every MAC operation, residual charge from prior computations is reused, directly reducing the dynamic switching energy that dominates analog CIM power budgets.
Charge reuse reduces switching energyReRAM Crossbar with In-Situ Matrix Decomposition
Zhejiang University's ReRAM-based CIM architecture decomposes transformer weight matrices using the Re-Transformer algorithm to reduce compute and write operands before mapping them onto ReRAM crossbar arrays. Because ReRAM cells store weights as physical conductance states, no energy is spent fetching weights from external DRAM during inference — the MAC operation occurs as current flows through the crossbar in response to applied input voltages. A hybrid softmax unit based on resistive RAM selection-and-comparison logic further reduces power by avoiding multi-pass digital softmax implementations.
Conductance-state weight storageFloating-Point Outlier Segregation
Transformer activations frequently exhibit outlier values with large dynamic range that are poorly served by fixed-point arithmetic in CIM arrays. KAIST's hybrid processor classifies input data into inliers and outliers. Inlier data undergoes fixed-point operations in a CIM operator, while outlier data is routed to a parallel NPU operator that performs floating-point arithmetic. An aggregation core then sums both result streams, capturing CIM energy benefits for the majority of computations while preserving numerical accuracy for the tail distribution of outlier activations.
Inlier CIM + outlier NPU parallel pathSelective Bit-Cell Disabling for Zero-Valued Inputs
When the sparsity of input data exceeds a threshold, individual bit cells within the CIM array are selectively disabled prior to processing, preventing unnecessary switching activity. A compensation value is then applied to the output to correct for the effect of disabled cells. This mechanism extends energy savings beyond data movement to the MAC computation itself and can be applied to transformer attention activations where many query-key inner products produce near-zero attention weights.
Zero-input cell gatingBit-Inversion Minimization for Nonvolatile Write Energy
An often-overlooked data movement energy cost is writing updated weights into nonvolatile CIM arrays. Yonsei University's hybrid memory CIM architecture remaps neural network layer weights across first and second CIM arrays to minimize the number of bit inversions during sequential weight storage. By selecting the weight array ordering that minimizes bit transitions, the architecture reduces both write energy and the endurance degradation of nonvolatile memory cells — meaningful for transformer models whose weights change between fine-tuning tasks.
Write energy via bit-inversion reductionCIM Innovation Landscape: Key Data Points
Patent filing trends and assignee activity derived from over 50 CIM patents filed 2020–2026, analysed via PatSnap Eureka.
CIM Patent Filings by Assignee (2020–2026)
KAIST leads with ~10 distinct filings; Qualcomm, IBM, Zhejiang and Tsinghua follow as the next tier of active assignees.
CIM Co-Optimisation Strategies in Transformer Patents
Sparsity exploitation, quantization, attention fusion, and 3D integration are the four most frequently cited strategies layered atop base CIM approaches.
Beyond Generic CIM: Architectures Built for Attention
The most impactful patents in this corpus go beyond applying CIM to generic DNN layers — they redesign the computation graph of self-attention itself to eliminate intermediate data movement.
Attention Fusion: QK × Softmax × SV in One In-Memory Pass
KAIST's Attention Fusion PIM Architecture (2025) merges query-key multiplication, softmax normalization, and value-weighted summation into a single contiguous in-memory computation, eliminating the need to write intermediate attention scores back to external memory between operations. Each attention-PIM cluster contains PIM engines for matrix multiplication, a vector operator for post-processing, and an attention memory for intermediate result storage — all on-die.
Triple Sparsity Handling: Weight, Activation, and Attention Score
The same KAIST attention fusion system simultaneously exploits weight sparsity, activation sparsity, and attention score sparsity to gate unnecessary MAC operations. This triple sparsity approach is uniquely suited to transformer models where sparse attention patterns emerge naturally from the softmax distribution — many query-key inner products produce near-zero attention weights that can be gated without accuracy loss.
KV-Cache PIM: Processing Key and Value Vectors Where They Live
SK Hynix's Neural Network Architecture for Multi-Head Attention (2025) distributes multi-head attention computation across PIM-enabled memory banks. Each PIM device contains memory banks storing key vectors and value vectors, with co-located processing units that execute attention operations using those locally resident vectors. Key and value vectors are stored in different access patterns to optimize the distinct access patterns of each in the attention computation — directly targeting the KV-cache access bottleneck in autoregressive transformer inference.
SRAM vs. eDRAM vs. ReRAM: Tradeoffs for Transformer CIM
The dominant hardware substrate choices across the patent corpus are SRAM, embedded DRAM (eDRAM), and resistive RAM (ReRAM/RRAM), each offering distinct tradeoffs in speed, endurance, and in-situ computation density. The choice of memory technology fundamentally shapes what CIM operations are practical and what energy savings are achievable.
eDRAM is favoured by KAIST's Digital 3T-eDRAM CIM Macro for its higher density than SRAM and its suitability for the proximity cell MAC-tree architecture. The 3T cell structure allows simultaneous read operations across sub-arrays, enabling the MAC tree to operate at the array periphery without a full memory read cycle. The tradeoff is refresh overhead and slightly lower speed compared to SRAM.
ReRAM/RRAM is the substrate of choice for Zhejiang University and Tsinghua University because weights are stored as physical conductance states — a nonvolatile representation that requires zero energy to maintain and zero fetch energy during inference. The MAC operation occurs as current flows through the crossbar in response to applied input voltages. The Re-Transformer matrix decomposition algorithm reduces the number of write operands before they reach the ReRAM array, compounding the energy savings. The endurance limitation of ReRAM cells motivates Yonsei University's bit-inversion minimization technique for weight remapping.
Hybrid approaches such as IBM's 2D mesh architecture combine analog CIM tiles (for high-efficiency MVM) with digital compute cores (for non-linear functions like softmax and layer normalization). Princeton University's scalable array architecture pairs an in-memory compute array for MVM operations with a near-memory compute SIMD unit for element-wise operations — a dual-mode capability critical for transformer layers that alternate between weight-dominated linear projections and activation-dominated operations. Explore the full technology landscape on PatSnap.
Key Assignees Driving CIM for Transformer Inference
The patent corpus spans filings from academia, semiconductor companies, and memory vendors — each approaching the data movement problem from a different vantage point.
KAIST (Korea Advanced Institute of Science and Technology)
The most prolific single assignee in this dataset, with patents spanning eDRAM-based CIM macros for transformer matrix multiplication, energy-efficient analog CIM processors with charge reuse, hybrid floating-point/fixed-point CIM for outlier handling, attention-fusion PIM with triple sparsity, hybrid sparse-dense transformer accelerators, and end-to-end on-device training PIM accelerators. KAIST filings consistently target the specific computational graph of transformer self-attention rather than generic DNN acceleration.
Transformer self-attention specialistQualcomm Incorporated
Contributes multiple jurisdictions of sparsity-aware CIM (US 2025, CN 2024, IN 2024) and CIM architectures for depthwise convolution, signaling a focus on edge deployment of machine learning across mobile and IoT hardware platforms. Qualcomm's sparsity-aware approach selectively disables CIM bit cells when input sparsity exceeds a threshold, extending energy savings beyond data movement to the MAC computation itself.
Edge ML · mobile & IoT focusInternational Business Machines Corporation (IBM)
Holds the two-dimensional mesh CIM accelerator architecture in both US and WO jurisdictions. IBM's approach is distinctive in combining analog CIM tiles with digital compute cores in a hybrid fabric, targeting large-scale DNN inference where analog-domain MAC operations provide high energy efficiency and digital cores handle non-linear functions. The 2D mesh topology enables weight matrices too large for a single tile to be partitioned across adjacent tiles with localized partial-sum accumulation.
Analog-digital hybrid meshZhejiang University
Pursues ReRAM-based CIM for transformer self-attention, with both Chinese and US continuations of the same matrix-decomposition architecture, signaling intent to build international IP around this specific algorithmic-hardware co-design. The Re-Transformer algorithm reduces the number of compute and write operands before mapping onto ReRAM crossbar arrays, compounding the in-situ energy savings with algorithmic reduction of the operand count itself.
ReRAM + algorithmic co-designTrack CIM patent activity across all assignees in real time
PatSnap Eureka monitors 2B+ data points across 120+ countries — set alerts for new CIM filings from any assignee.
CIM Patent Activity: Acceleration in 2023–2026
The bulk of active CIM patents for transformer inference are concentrated in the 2023–2026 window, reflecting the rapid maturation of LLM deployment as a commercial priority.
CIM Patent Filing Activity by Year (Indexed, 2020–2026)
Filing activity accelerated sharply from 2022 onward, coinciding with the commercial deployment of large language models and the emergence of transformer-specific CIM architectures.
Compute-in-Memory for Transformer Inference — key questions answered
In large language model inference, the repeated movement of weights and activations between compute units and multi-level caches can account for more than 80% of total system energy, severely constraining both energy efficiency and scalability. The self-attention mechanism requires computing scaled dot-product attention over queries, keys, and values — operations whose memory bandwidth demands scale quadratically with sequence length.
CIM architectures eliminate DRAM access energy for weight data by performing multiply-accumulate (MAC) operations directly within or adjacent to the memory array — whether SRAM, RRAM, or eDRAM — so that matrix-vector multiplications are executed in-situ, eliminating the dominant source of energy dissipation. Compared to traditional CMOS-based CNN accelerators, CIM accelerators for neural network inference can improve energy efficiency by two to three orders of magnitude.
Attention fusion merges the query-key multiplication, softmax, and value-weighted summation steps into a single contiguous in-memory computation, eliminating the need to write intermediate attention scores back to external memory between operations. This is the transformer-specific CIM optimization demonstrated by KAIST's Attention Fusion Processing-in-Memory Architecture (2025), which also simultaneously exploits weight sparsity, activation sparsity, and attention score sparsity to gate unnecessary MAC operations.
The dominant hardware substrate choices are SRAM, embedded DRAM (eDRAM), and resistive RAM (ReRAM/RRAM), each offering distinct tradeoffs in speed, endurance, and in-situ computation density. KAIST uses 3T-eDRAM for proximity MAC-tree computation; Zhejiang University uses ReRAM crossbar arrays where weights are stored as physical conductance states; Tsinghua University stacks RRAM-CIM and CFET 2T0C-CIM layers in monolithic 3D integration.
When the sparsity of input data exceeds a threshold, individual bit cells within the CIM array are selectively disabled prior to processing, preventing unnecessary switching activity. A compensation value is then applied to the output to correct for the effect of disabled cells. This mechanism, formalized by Qualcomm in their Sparsity-Aware Compute-in-Memory patent (US, 2025), extends energy savings beyond data movement to the MAC computation itself.
The dominant assignees by patent count are Korea Advanced Institute of Science and Technology (KAIST) with approximately 10 distinct filings, followed by Qualcomm Incorporated, International Business Machines Corporation (IBM), Zhejiang University, and Tsinghua University. Secondary contributors include SK Hynix, Samsung SDS, Yonsei University, Princeton University, and Nanjing University. The patent corpus spans filings from 2020 through 2026, with the bulk of active patents concentrated in the 2023–2026 window.
Still have questions? Let PatSnap Eureka search the full patent corpus for you.
Ask Eureka About CIM ArchitectureStop Moving Data. Start Accelerating Innovation.
Join 18,000+ innovators already using PatSnap Eureka to track CIM architectures, transformer inference patents, and the researchers driving the next generation of energy-efficient AI hardware.
References
- Digital 3T-eDRAM Based CIM Macro for Accelerating Matrix Multiplications in Transformer Model with High-Accuracy and High Compute-Efficiency — KAIST, 2025
- Energy-Efficient In-Memory Computation Processor and Method Using Neural Network Data Distribution — KAIST, 2025
- Process-in-Memory Architecture Based on Resistive Random Access Memory and Matrix Decomposition Acceleration Algorithm — Zhejiang University, 2025
- 基于阻变存储器和矩阵分解加速算法的存内架构 — Zhejiang University, 2024
- Attention Fusion Processing-in-Memory Architecture for Transformer Acceleration with Triple Sparsity-Handling — KAIST, 2025
- Sparsity-Aware Compute-in-Memory — Qualcomm Incorporated, 2025
- 稀疏性感知的存算一体 — Qualcomm, 2024
- Sparsity-Aware In-Memory Computing — Qualcomm Incorporated, 2024
- Two-Dimensional Mesh for Compute-in-Memory Accelerator Architecture — IBM, 2023
- Two-Dimensional Mesh for Compute-in-Memory Accelerator Architecture — IBM, 2025
- 用于存储器内计算加速器架构的二维网格 — IBM, 2024
- 基于单片三维集成的Transformer加速器架构 — Tsinghua University, 2024
- 三维异构集成存算一体处理方法及装置 — Northern Integrated Circuit Technology Innovation Center, 2025
- Neural Network Architecture for Multi-Head Attention Operation Based on Transformer — SK Hynix, 2025
- Multi-Chip-Module CIM Based Hybrid Sparse-Dense CIM Transformer Accelerator with Transpose Macro — KAIST, 2024
- Apparatus for Calculating Deep Neural Network for Energy-Efficient Floating Point Calculation — KAIST, 2025
- CIM Based on Hybrid Memory and Method for Storing Weight Thereof — Yonsei University, 2025
- Generative AI Accelerator Apparatus Using In-Memory Compute Chiplet Devices for Transformer Workloads — D-Matrix Corporation, 2023
- Scalable Array Architecture for In-Memory Computing — Princeton University, 2022
- 面向LLM推理的异构芯粒架构仿真与搜索方法及系统 — Shanghai Jiao Tong University, 2026
- 面向存内计算的卷积神经网络加速器架构的自动综合方法 — Institute of Computing Technology, CAS, 2025
- Method, Device, System for Processing-in-Memory Computation Offloading for AI Model Inference — Samsung SDS, 2025
- 面向高能效注意力计算的全数字存内计算加速器及方法 — Chongqing University, 2025
- Attention Is All You Need — Vaswani et al., 2017 (foundational transformer architecture reference)
- JEDEC — DRAM and memory interface standards body
- IEEE — Institute of Electrical and Electronics Engineers (ReRAM endurance standards and publications)
All data and statistics on this page are sourced from the references above and from PatSnap's proprietary innovation intelligence platform. Patent analysis conducted via PatSnap Eureka.
PatSnap Eureka searches 50+ patents and research literature to answer instantly.