Book a demo

Cut patent&paper research from weeks to hours with PatSnap Eureka AI!

Try now

Compute-in-Memory for Transformer Inference — PatSnap Eureka

Compute-in-Memory for Transformer Inference — PatSnap Eureka
CIM · Transformer Inference

How Compute-in-Memory Eliminates Data Movement Energy in Transformer Inference

In large language model inference, data movement between compute units and memory can account for more than 80% of total system energy. CIM architectures perform multiply-accumulate operations directly within the memory array — eliminating the dominant energy cost at its source.

LLM Inference Energy Breakdown: Data Movement >80%, Arithmetic <20% — CIM eliminates the dominant cost In conventional von Neumann LLM inference, over 80% of total system energy is consumed by data movement between compute units and multi-level caches, leaving less than 20% for actual arithmetic. CIM architectures eliminate this imbalance by performing MAC operations in-situ within the memory array. Source: PatSnap Eureka patent analysis, Shanghai Jiao Tong University (2026). LLM Inference: Where Energy Goes >80% data movement >80% — Data Movement <20% — Arithmetic CIM eliminates the blue slice Source: PatSnap Eureka · Shanghai Jiao Tong University patent, 2026
>80%
of LLM inference energy spent on data movement
1000×
energy efficiency improvement possible with CIM vs. CMOS
50+
patents analysed across KR, US, CN, EP (2020–2026)
~10
KAIST filings — the most prolific single assignee
The Root Cause

Why Transformer Inference Hits the Memory Wall

Transformer models impose a uniquely challenging memory access pattern on conventional hardware. The self-attention mechanism requires computing scaled dot-product attention over queries, keys, and values — operations whose memory bandwidth demands scale quadratically with sequence length. Conventional accelerators must continuously load weight matrices from off-chip DRAM through bandwidth-limited memory buses, causing the majority of system energy to be spent on data movement rather than arithmetic.

As documented in a 2026 Shanghai Jiao Tong University patent on heterogeneous LLM inference chiplet simulation, in large language model inference the repeated movement of weights and activations between compute units and multi-level caches can account for more than 80% of total system energy, severely constraining both energy efficiency and scalability.

Existing processors designed for convolutional neural networks cannot efficiently handle the unique computation flow of scaled dot-product attention, as noted in Zhejiang University's 2025 ReRAM-based process-in-memory patent. The fundamental mismatch is architectural: the von Neumann bottleneck was tolerable for CNN workloads but becomes crippling for transformer models where the KV-cache must be accessed at every token generation step.

CIM technology addresses this by embedding multiply-accumulate (MAC) units directly within memory arrays — whether SRAM, RRAM, or eDRAM — so that matrix-vector multiplications are executed in-situ, eliminating the dominant source of energy dissipation.

>80%
System energy consumed by data movement in LLM inference
2–3 OOM
Energy efficiency gain: CIM vs. traditional CMOS CNN accelerators
O(n²)
Bandwidth scaling of self-attention with sequence length
2020–26
Patent filing window analysed; bulk concentrated in 2023–2026
  • MAC units embedded directly within SRAM, eDRAM, or ReRAM arrays
  • Weight data never leaves the memory die during inference
  • KV-cache vectors processed where they are stored
  • Quadratic attention bandwidth cost contained in-situ
Explore CIM Patent Landscape →
Core CIM Mechanisms

How CIM Architectures Eliminate Data Movement at the Hardware Level

Five distinct hardware mechanisms have been patented for transformer inference acceleration, each targeting a different layer of the data movement problem.

eDRAM · KAIST 2025

Proximity Cell MAC-Tree in 3T-eDRAM

KAIST's Digital 3T-eDRAM CIM Macro stores weight data in sub-arrays that perform simultaneous read operations. An operator built as a "proximity cell MAC tree" performs multiply-accumulate operations directly on weight data read from the array and activation data injected through a macro controller. By keeping weight data resident in the eDRAM array and performing MAC operations at the array periphery rather than after a full memory read cycle, the architecture avoids repeated round-trips to external DRAM during transformer layer computations.

In-array MAC at periphery
Analog CIM · KAIST 2025

Voltage-Capacitance Dual Coding with Charge Reuse

KAIST's energy-efficient in-memory computation processor performs analog computation through voltage-capacitance dual coding, computation word line charge reuse, and a signal amplification pipeline culminating in time-to-digital conversion. The charge reuse mechanism is particularly significant: instead of fully resetting and recharging memory bitlines for every MAC operation, residual charge from prior computations is reused, directly reducing the dynamic switching energy that dominates analog CIM power budgets.

Charge reuse reduces switching energy
ReRAM · Zhejiang University 2024–25

ReRAM Crossbar with In-Situ Matrix Decomposition

Zhejiang University's ReRAM-based CIM architecture decomposes transformer weight matrices using the Re-Transformer algorithm to reduce compute and write operands before mapping them onto ReRAM crossbar arrays. Because ReRAM cells store weights as physical conductance states, no energy is spent fetching weights from external DRAM during inference — the MAC operation occurs as current flows through the crossbar in response to applied input voltages. A hybrid softmax unit based on resistive RAM selection-and-comparison logic further reduces power by avoiding multi-pass digital softmax implementations.

Conductance-state weight storage
Hybrid CIM-NPU · KAIST 2025

Floating-Point Outlier Segregation

Transformer activations frequently exhibit outlier values with large dynamic range that are poorly served by fixed-point arithmetic in CIM arrays. KAIST's hybrid processor classifies input data into inliers and outliers. Inlier data undergoes fixed-point operations in a CIM operator, while outlier data is routed to a parallel NPU operator that performs floating-point arithmetic. An aggregation core then sums both result streams, capturing CIM energy benefits for the majority of computations while preserving numerical accuracy for the tail distribution of outlier activations.

Inlier CIM + outlier NPU parallel path
Sparsity Gating · Qualcomm 2025

Selective Bit-Cell Disabling for Zero-Valued Inputs

When the sparsity of input data exceeds a threshold, individual bit cells within the CIM array are selectively disabled prior to processing, preventing unnecessary switching activity. A compensation value is then applied to the output to correct for the effect of disabled cells. This mechanism extends energy savings beyond data movement to the MAC computation itself and can be applied to transformer attention activations where many query-key inner products produce near-zero attention weights.

Zero-input cell gating
Weight Remapping · Yonsei University 2025

Bit-Inversion Minimization for Nonvolatile Write Energy

An often-overlooked data movement energy cost is writing updated weights into nonvolatile CIM arrays. Yonsei University's hybrid memory CIM architecture remaps neural network layer weights across first and second CIM arrays to minimize the number of bit inversions during sequential weight storage. By selecting the weight array ordering that minimizes bit transitions, the architecture reduces both write energy and the endurance degradation of nonvolatile memory cells — meaningful for transformer models whose weights change between fine-tuning tasks.

Write energy via bit-inversion reduction
PatSnap Eureka

Map the full CIM patent landscape for your R&D team

50+ patents across KAIST, Qualcomm, IBM, Zhejiang, Tsinghua and more — all searchable in one platform.

Analyse CIM Patents in Eureka
Patent Intelligence

CIM Innovation Landscape: Key Data Points

Patent filing trends and assignee activity derived from over 50 CIM patents filed 2020–2026, analysed via PatSnap Eureka.

CIM Patent Filings by Assignee (2020–2026)

KAIST leads with ~10 distinct filings; Qualcomm, IBM, Zhejiang and Tsinghua follow as the next tier of active assignees.

CIM Patent Filings by Assignee 2020–2026: KAIST ~10 patents, Qualcomm ~4, IBM ~3, Zhejiang University ~2, Tsinghua University ~2, SK Hynix ~1, Princeton ~1 Patent filing counts for leading compute-in-memory assignees targeting transformer inference acceleration, based on analysis of 50+ patents via PatSnap Eureka. KAIST is the dominant single assignee with approximately 10 distinct filings covering eDRAM, analog CIM, sparsity, and attention fusion architectures. 10 8 6 4 2 ~10 KAIST ~4 Qualcomm ~3 IBM ~2 Zhejiang ~2 Tsinghua ~1 SK Hynix Source: PatSnap Eureka · 50+ CIM patents · 2020–2026

CIM Co-Optimisation Strategies in Transformer Patents

Sparsity exploitation, quantization, attention fusion, and 3D integration are the four most frequently cited strategies layered atop base CIM approaches.

CIM Co-Optimisation Strategy Frequency in Transformer Patents: Sparsity Exploitation most cited, followed by Quantization, Attention Fusion, 3D Integration, and Memory Remapping The four most frequently cited co-optimisation strategies layered atop base CIM approaches in the 50+ patent corpus, as identified through PatSnap Eureka analysis. Sparsity exploitation and quantization appear most broadly across assignees; attention fusion is the transformer-specific optimisation; 3D integration targets residual inter-module data movement. Sparsity Quantization Attn. Fusion 3D Integration Wt. Remapping Most cited High Xfmr-specific Growing Emerging Source: PatSnap Eureka · Relative citation frequency across 50+ patents

Want the full patent breakdown with assignee timelines and claim analysis?

Run Your Own CIM Patent Search
Transformer-Specific Innovations

Beyond Generic CIM: Architectures Built for Attention

The most impactful patents in this corpus go beyond applying CIM to generic DNN layers — they redesign the computation graph of self-attention itself to eliminate intermediate data movement.

🔗

Attention Fusion: QK × Softmax × SV in One In-Memory Pass

KAIST's Attention Fusion PIM Architecture (2025) merges query-key multiplication, softmax normalization, and value-weighted summation into a single contiguous in-memory computation, eliminating the need to write intermediate attention scores back to external memory between operations. Each attention-PIM cluster contains PIM engines for matrix multiplication, a vector operator for post-processing, and an attention memory for intermediate result storage — all on-die.

Triple Sparsity Handling: Weight, Activation, and Attention Score

The same KAIST attention fusion system simultaneously exploits weight sparsity, activation sparsity, and attention score sparsity to gate unnecessary MAC operations. This triple sparsity approach is uniquely suited to transformer models where sparse attention patterns emerge naturally from the softmax distribution — many query-key inner products produce near-zero attention weights that can be gated without accuracy loss.

🧱

KV-Cache PIM: Processing Key and Value Vectors Where They Live

SK Hynix's Neural Network Architecture for Multi-Head Attention (2025) distributes multi-head attention computation across PIM-enabled memory banks. Each PIM device contains memory banks storing key vectors and value vectors, with co-located processing units that execute attention operations using those locally resident vectors. Key and value vectors are stored in different access patterns to optimize the distinct access patterns of each in the attention computation — directly targeting the KV-cache access bottleneck in autoregressive transformer inference.

🔒
Unlock 3D Integration & MoE CIM Architectures
See how Tsinghua and Northern IC eliminate residual inter-module data movement with monolithic 3D stacking — the frontier beyond flat CIM.
RRAM + CFET 3D stack MoE expert routing Vertical via interconnect + more
Access Full Analysis in Eureka →
Memory Substrate Choices

SRAM vs. eDRAM vs. ReRAM: Tradeoffs for Transformer CIM

The dominant hardware substrate choices across the patent corpus are SRAM, embedded DRAM (eDRAM), and resistive RAM (ReRAM/RRAM), each offering distinct tradeoffs in speed, endurance, and in-situ computation density. The choice of memory technology fundamentally shapes what CIM operations are practical and what energy savings are achievable.

eDRAM is favoured by KAIST's Digital 3T-eDRAM CIM Macro for its higher density than SRAM and its suitability for the proximity cell MAC-tree architecture. The 3T cell structure allows simultaneous read operations across sub-arrays, enabling the MAC tree to operate at the array periphery without a full memory read cycle. The tradeoff is refresh overhead and slightly lower speed compared to SRAM.

ReRAM/RRAM is the substrate of choice for Zhejiang University and Tsinghua University because weights are stored as physical conductance states — a nonvolatile representation that requires zero energy to maintain and zero fetch energy during inference. The MAC operation occurs as current flows through the crossbar in response to applied input voltages. The Re-Transformer matrix decomposition algorithm reduces the number of write operands before they reach the ReRAM array, compounding the energy savings. The endurance limitation of ReRAM cells motivates Yonsei University's bit-inversion minimization technique for weight remapping.

Hybrid approaches such as IBM's 2D mesh architecture combine analog CIM tiles (for high-efficiency MVM) with digital compute cores (for non-linear functions like softmax and layer normalization). Princeton University's scalable array architecture pairs an in-memory compute array for MVM operations with a near-memory compute SIMD unit for element-wise operations — a dual-mode capability critical for transformer layers that alternate between weight-dominated linear projections and activation-dominated operations. Explore the full technology landscape on PatSnap.

Memory Key Advantage Key Assignee
eDRAM High density; proximity MAC tree KAIST
ReRAM Conductance-state weights; zero fetch energy Zhejiang / Tsinghua
SRAM Speed; digital CIM compatibility Qualcomm / IBM
DRAM-PIM KV-cache colocation; memory vendor approach SK Hynix
Compare Memory Technologies in Eureka
Competitive Landscape

Key Assignees Driving CIM for Transformer Inference

The patent corpus spans filings from academia, semiconductor companies, and memory vendors — each approaching the data movement problem from a different vantage point.

~10 filings · Korea

KAIST (Korea Advanced Institute of Science and Technology)

The most prolific single assignee in this dataset, with patents spanning eDRAM-based CIM macros for transformer matrix multiplication, energy-efficient analog CIM processors with charge reuse, hybrid floating-point/fixed-point CIM for outlier handling, attention-fusion PIM with triple sparsity, hybrid sparse-dense transformer accelerators, and end-to-end on-device training PIM accelerators. KAIST filings consistently target the specific computational graph of transformer self-attention rather than generic DNN acceleration.

Transformer self-attention specialist
Multi-jurisdiction · US / CN / IN

Qualcomm Incorporated

Contributes multiple jurisdictions of sparsity-aware CIM (US 2025, CN 2024, IN 2024) and CIM architectures for depthwise convolution, signaling a focus on edge deployment of machine learning across mobile and IoT hardware platforms. Qualcomm's sparsity-aware approach selectively disables CIM bit cells when input sparsity exceeds a threshold, extending energy savings beyond data movement to the MAC computation itself.

Edge ML · mobile & IoT focus
US + WO + CN · 2023–25

International Business Machines Corporation (IBM)

Holds the two-dimensional mesh CIM accelerator architecture in both US and WO jurisdictions. IBM's approach is distinctive in combining analog CIM tiles with digital compute cores in a hybrid fabric, targeting large-scale DNN inference where analog-domain MAC operations provide high energy efficiency and digital cores handle non-linear functions. The 2D mesh topology enables weight matrices too large for a single tile to be partitioned across adjacent tiles with localized partial-sum accumulation.

Analog-digital hybrid mesh
CN + US continuations · 2024–25

Zhejiang University

Pursues ReRAM-based CIM for transformer self-attention, with both Chinese and US continuations of the same matrix-decomposition architecture, signaling intent to build international IP around this specific algorithmic-hardware co-design. The Re-Transformer algorithm reduces the number of compute and write operands before mapping onto ReRAM crossbar arrays, compounding the in-situ energy savings with algorithmic reduction of the operand count itself.

ReRAM + algorithmic co-design
🔒
Unlock SK Hynix, Tsinghua & More Assignee Profiles
See the full competitive landscape including memory vendors, chiplet designers, and Chinese university IP strategies.
SK Hynix PIM strategy Tsinghua 3D stacking D-Matrix chiplets + more
View All Assignees in Eureka →

Track CIM patent activity across all assignees in real time

PatSnap Eureka monitors 2B+ data points across 120+ countries — set alerts for new CIM filings from any assignee.

Set Up Patent Alerts
Filing Velocity

CIM Patent Activity: Acceleration in 2023–2026

The bulk of active CIM patents for transformer inference are concentrated in the 2023–2026 window, reflecting the rapid maturation of LLM deployment as a commercial priority.

CIM Patent Filing Activity by Year (Indexed, 2020–2026)

Filing activity accelerated sharply from 2022 onward, coinciding with the commercial deployment of large language models and the emergence of transformer-specific CIM architectures.

CIM Patent Filing Activity by Year 2020–2026: 2020 low activity, 2021 low, 2022 moderate, 2023 rising, 2024 high, 2025 peak, 2026 active (partial year) Indexed filing activity for compute-in-memory patents targeting transformer inference, based on PatSnap Eureka analysis of 50+ patents filed 2020–2026. The bulk of active patents are concentrated in the 2023–2026 window, with 2025 representing the peak filing year in the dataset. High Mid Low 2020 2021 2022 2023 2024 2025 ★ 2026 Peak year Source: PatSnap Eureka · 50+ CIM transformer inference patents · 2020–2026

Monitor new CIM filings as they publish — before competitors do.

Track CIM Innovation in Real Time
Frequently asked questions

Compute-in-Memory for Transformer Inference — key questions answered

Still have questions? Let PatSnap Eureka search the full patent corpus for you.

Ask Eureka About CIM Architecture
PatSnap Eureka

Stop Moving Data. Start Accelerating Innovation.

Join 18,000+ innovators already using PatSnap Eureka to track CIM architectures, transformer inference patents, and the researchers driving the next generation of energy-efficient AI hardware.

References

  1. Digital 3T-eDRAM Based CIM Macro for Accelerating Matrix Multiplications in Transformer Model with High-Accuracy and High Compute-Efficiency — KAIST, 2025
  2. Energy-Efficient In-Memory Computation Processor and Method Using Neural Network Data Distribution — KAIST, 2025
  3. Process-in-Memory Architecture Based on Resistive Random Access Memory and Matrix Decomposition Acceleration Algorithm — Zhejiang University, 2025
  4. 基于阻变存储器和矩阵分解加速算法的存内架构 — Zhejiang University, 2024
  5. Attention Fusion Processing-in-Memory Architecture for Transformer Acceleration with Triple Sparsity-Handling — KAIST, 2025
  6. Sparsity-Aware Compute-in-Memory — Qualcomm Incorporated, 2025
  7. 稀疏性感知的存算一体 — Qualcomm, 2024
  8. Sparsity-Aware In-Memory Computing — Qualcomm Incorporated, 2024
  9. Two-Dimensional Mesh for Compute-in-Memory Accelerator Architecture — IBM, 2023
  10. Two-Dimensional Mesh for Compute-in-Memory Accelerator Architecture — IBM, 2025
  11. 用于存储器内计算加速器架构的二维网格 — IBM, 2024
  12. 基于单片三维集成的Transformer加速器架构 — Tsinghua University, 2024
  13. 三维异构集成存算一体处理方法及装置 — Northern Integrated Circuit Technology Innovation Center, 2025
  14. Neural Network Architecture for Multi-Head Attention Operation Based on Transformer — SK Hynix, 2025
  15. Multi-Chip-Module CIM Based Hybrid Sparse-Dense CIM Transformer Accelerator with Transpose Macro — KAIST, 2024
  16. Apparatus for Calculating Deep Neural Network for Energy-Efficient Floating Point Calculation — KAIST, 2025
  17. CIM Based on Hybrid Memory and Method for Storing Weight Thereof — Yonsei University, 2025
  18. Generative AI Accelerator Apparatus Using In-Memory Compute Chiplet Devices for Transformer Workloads — D-Matrix Corporation, 2023
  19. Scalable Array Architecture for In-Memory Computing — Princeton University, 2022
  20. 面向LLM推理的异构芯粒架构仿真与搜索方法及系统 — Shanghai Jiao Tong University, 2026
  21. 面向存内计算的卷积神经网络加速器架构的自动综合方法 — Institute of Computing Technology, CAS, 2025
  22. Method, Device, System for Processing-in-Memory Computation Offloading for AI Model Inference — Samsung SDS, 2025
  23. 面向高能效注意力计算的全数字存内计算加速器及方法 — Chongqing University, 2025
  24. Attention Is All You Need — Vaswani et al., 2017 (foundational transformer architecture reference)
  25. JEDEC — DRAM and memory interface standards body
  26. IEEE — Institute of Electrical and Electronics Engineers (ReRAM endurance standards and publications)

All data and statistics on this page are sourced from the references above and from PatSnap's proprietary innovation intelligence platform. Patent analysis conducted via PatSnap Eureka.

Ask PatSnap Eureka
Ask PatSnap Eureka
AI innovation intelligence · always on
Ask anything about compute-in-memory for transformer inference.
PatSnap Eureka searches 50+ patents and research literature to answer instantly.
Try asking
Powered by PatSnap Eureka