Book a demo

Cut patent&paper research from weeks to hours with PatSnap Eureka AI!

Try now

Sparse attention cuts transformer O(n²) complexity

Sparse Attention Mechanism: Reducing Quadratic Complexity — PatSnap Insights
AI & Machine Learning

Standard transformer self-attention scales as O(n²) with sequence length — a bottleneck that makes long-context inference increasingly impractical. Drawing on approximately 60 patent filings from Microsoft, Google, Tsinghua University, Peking University, Samsung, and others, this article maps the four dominant strategies researchers are patenting to break that quadratic barrier.

PatSnap Insights Team Innovation Intelligence Analysts 9 min read
Share
Reviewed by the PatSnap Insights editorial team ·

Why O(n²) Complexity Breaks Long-Context Inference

Standard transformer self-attention computes pairwise interactions between every token and every other token in a sequence, producing a computational cost that scales as O(n²) with sequence length n. As documented across the patent dataset analysed here, this means that doubling a context window quadruples the computation — and as context windows grow to tens or hundreds of thousands of tokens, inference latency rises sharply and memory consumption increases significantly. The University of Electronic Science and Technology of China states this directly in its patent filings: the self-attention mechanism’s computational complexity grows quadratically with context length.

~60
Patent records & technical disclosures analysed
4
Dominant technical strategy categories identified
4
Jurisdictions covered: CN, US, KR, PCT
O(n²)
Complexity of standard self-attention being replaced

The dataset examined here encompasses approximately 60 patent records and technical disclosures spanning China, the United States, South Korea, and international PCT filings. The dominant assignees include Microsoft Technology Licensing LLC, Chinese academic institutions such as Tsinghua University, Harbin Institute of Technology, Peking University, and the University of Electronic Science and Technology of China, as well as Google LLC and Samsung Electronics. Together, these filings map four broad solution categories: structured and dynamic sparsity masking applied to attention heads; token-level compression and reduction; hierarchical local-global attention decomposition; and anchor-based or block-level attention approximation.

Standard transformer self-attention requires O(n²) computation proportional to the square of sequence length, causing inference latency to rise sharply and memory consumption to increase significantly as context windows grow — a bottleneck documented across approximately 60 patent filings from institutions including Microsoft, Google, Tsinghua University, and Samsung Electronics.

What is quadratic attention complexity?

In standard self-attention, each of the n tokens in a sequence must compute a similarity score against all other n tokens. This produces an n×n attention matrix, meaning both memory and computation scale as n². For a sequence of 10,000 tokens, this is 100 million pairwise operations; at 100,000 tokens, it is 10 billion — making long-context inference computationally prohibitive without algorithmic intervention.

The core empirical insight motivating sparse attention research is that attention matrices are highly sparse in practice: only a small fraction of token pairs carry meaningful attention weights. This structural sparsity can be exploited systematically, and the patent filings in this dataset represent the current frontier of how that exploitation is being engineered and protected across four distinct technical paradigms.

Figure 1 — Four dominant sparse attention strategy categories identified across the patent dataset
Four dominant sparse attention strategy categories for reducing quadratic complexity in transformer inference 0 5 10 15 ~15 Sparsity Masking ~12 Token Compression ~18 Global-Local Decomposition ~15 Anchor/Block & Implicit Approx. patent filings
Approximate distribution of ~60 patent filings across the four dominant sparse attention strategy categories identified in the dataset; global-local decomposition represents the most active area of filing activity.

Masking, Pruning, and Dynamic Sparsity Patterns

Sparse attention masks reduce quadratic complexity by restricting each token’s attention to a subset of other tokens, so that only non-masked elements of the attention matrix are computed. The key empirical foundation is that attention matrices are highly sparse in practice — only a small fraction of token pairs carry meaningful weights — making systematic mask application both accurate and computationally effective.

Microsoft Technology Licensing’s patent filings introduce a two-stage pipeline for head-level sparsification. A calibration stage conducts a sparsity pattern search across all attention heads in the transformer layers; a subsequent inferencing stage then masks each head with the head-specific pattern discovered during calibration. By pre-filling context using precomputed sparse masks rather than dense attention scores, the system computes only the non-masked elements of the attention matrix — directly reducing the O(n²) computation to a sparser regime without incurring per-inference pattern search overhead. This calibration-then-inference separation is a deliberate engineering choice that makes the approach practical for deployment.

Microsoft Technology Licensing’s sparse attention pipeline discovers head-specific sparsity patterns during a one-time calibration stage and reuses them at inference time, avoiding per-inference pattern search overhead while directly reducing the O(n²) attention computation to a sparser regime.

An alternative approach avoids a calibration pipeline entirely by deriving structural patterns from the attention matrix itself. A method from Sogang University (서강대학교산학협력단) first performs a full attention operation for an initial number of steps to compute baseline attention scores, then applies convolution patterns to detect diagonal and vertical structural regularities in those scores. The extracted positional information is encoded into a sparse pattern P, which is applied from the next step onward — reducing the total number of attention operations without requiring expensive per-inference pattern search. This convolution-derived approach is particularly well-suited to deployment scenarios where calibration data is unavailable.

“By generating masks ahead of the actual computation — early mask generation — the method adjusts the data flow within the transformer, accelerating matrix operations while reducing computational load across independently optimised prefill and decode phases.”

Harbin Institute of Technology’s multi-stage dynamic sparse optimisation method determines an input matrix, then generates mask matrices to guide matrix multiplication independently at each stage of the transformer inference pipeline. The multi-stage design enables independent optimisation across the prefill and decode phases, which have fundamentally different computational profiles. For attention head importance-based pruning, a method from Shanghai Tianshu Zhixin Semiconductor computes an importance index for each attention head with respect to the current token before the attention module processes it, filters to retain only the target heads, and performs inference solely over the retained set. This approach further applies sparsification to feed-forward neurons, creating a two-level sparsity strategy across both the attention and MLP sublayers — a design that compound-reduces the total FLOPs consumed per inference step. According to IEEE research on efficient neural network inference, combining attention and MLP sparsification is among the most effective approaches to end-to-end inference acceleration.

Explore the full patent landscape for sparse attention mechanisms in PatSnap Eureka.

Explore Patent Data in PatSnap Eureka →

Token Compression and Block-Based Attention: Reducing n Before the Quadratic Hits

Token compression approaches attack the quadratic problem from a different angle: rather than masking the attention matrix, they reduce the effective sequence length presented to the attention mechanism. If n tokens can be compressed to m representative tokens where m is much less than n, the resulting attention computation scales as O(m²) instead of O(n²) — a reduction that compounds across layers.

Zhejiang Lab’s anchor compression method inserts “anchor compression modules” into intermediate transformer layers. These modules perform importance analysis on the full input sequence to dynamically select a set of representative anchor positions. The anchors then serve as query vectors and, together with the original sequence, generate a compressed representation through sparse attention. Because the number of anchor positions is much smaller than the full sequence length, both attention computation complexity and GPU memory usage are reduced. Critically, the anchor compression module uses a plug-in architecture that requires no retraining of the base model, making it deployable on pre-trained models such as LLaMA, GPT, BERT, and Qwen.

Zhejiang Lab’s anchor compression approach reduces transformer attention complexity by selecting representative anchor positions — far fewer than the full sequence length — and uses a plug-in architecture requiring no retraining, making it compatible with pre-trained models including LLaMA, GPT, BERT, and Qwen.

Nankai University’s token fusion method exploits the empirical observation that many tokens in each layer’s input and output are highly similar to one another. By grouping tokens according to similarity, generating a grouping mask matrix, and fusing similar tokens into a compressed representation before feeding them to the FFN module, the effective matrix dimensions entering subsequent computation are reduced. The method focuses specifically on the FFN sublayer — where parameters constitute a large share of total model parameters — as the primary beneficiary of this compression strategy.

Key finding: Precise-plus-fuzzy block attention

A method from Zhongshu (Xiamen) Information Technology partitions the key-value matrix into blocks, computes block-level averages, and uses correlation scores between the query and block averages — via top-k nucleus sampling logic — to determine which regions receive exact (precise) attention versus approximate (fuzzy) attention. Local neighbour regions receive precise computation; distant global regions receive block-averaged approximations. This combined strategy avoids the full quadratic cost while retaining both local and global contextual information.

The block-level analog to token fusion is described in CHAOS INDUSTRIES’ “Divide and attend long range block attention” method, which divides an input sequence of length n into N blocks, each shorter than n. Attention is computed within each block sequentially, with the output of the first block feeding as input for the computation of the next block’s output. This sequential block chaining provides long-range dependency modelling while keeping per-block attention computation sub-quadratic relative to the full sequence length — an approach that aligns with established ACM findings on sliding-window attention architectures for efficient sequence modelling.

Figure 2 — Sparse attention complexity reduction: from O(n²) full attention to sub-quadratic regimes
Sparse attention mechanism complexity reduction from O(n squared) to sub-quadratic regimes including O(m squared) token compression and block attention Low Med High V.High Compute cost Full O(n²) Sparse Mask Token Compress Implicit/Block O(n²) Reduced O(m²), m<<n Sub-quadratic Baseline Sparse masking Token compression Implicit / block
Illustrative comparison of compute cost across the four sparse attention paradigms: token compression reduces complexity to O(m²) where m is much less than n, while implicit and block-based methods achieve the lowest relative cost by restructuring the computation graph entirely.

Global-Local Decomposition and Hierarchical Architectures

Global-local attention decomposition splits the attention computation into two components: dense local attention within a bounded window, and a compressed global representation computed at sub-quadratic cost. This preserves both fine-grained local precision and long-range contextual awareness without applying full O(n²) attention across the entire sequence.

Microsoft Technology Licensing’s SSM-enhanced transformer implements this with a two-layer encoder architecture. A global layer computes a global self-attention vector for each token in a local input sequence drawn from a larger global input. A local layer then computes local self-attention over each local sequence and combines it with the global vector through addition and normalisation. The encoder representation thus includes both global context — captured at lower cost via the global layer’s aggregation over sub-sequences — and local precision, enabling long-range modelling without O(n²) cost over the full global sequence. This architecture is protected via filings in both the United States and South Korea.

Peking University’s instance-adaptive sparse pattern selection mechanism takes a different hierarchical approach. Shallow transformer layers process all input with standard attention; a mode selector attached to the last shallow layer analyses the output hidden vectors and assigns weights to multiple pre-defined sparse patterns. Each deep transformer layer then executes sparse attention weighted by the pattern importance scores. This instance-level adaptation enables the model to select different sparse patterns for different inputs, outperforming static single-pattern sparsification by tailoring computational allocation to individual input characteristics. Research published by Nature on adaptive neural architectures supports the finding that input-conditional computation allocation consistently outperforms fixed-budget approaches.

Peking University’s instance-adaptive sparse pattern selection uses shallow transformer layers to analyse input characteristics and assign weights to multiple pre-defined sparse patterns, enabling each deep transformer layer to execute sparse attention tailored to individual inputs — outperforming static sparse masking approaches.

Tsinghua University addresses a key limitation of uniform sparsity: the same sparse pattern cannot optimally serve different input lengths. Its method evaluates how each attention mask affects the model’s final prediction, determines accuracy loss across different input sequence lengths, and uses a Pareto-front search — with sequence length, accuracy loss, and density as objectives — to assign heterogeneous scaling rules, each with head-level hyperparameters, to different sparse attention heads at different input lengths. The result is a flexible sparse attention configuration that adapts to varying context window sizes, directly addressing the deployment challenge of variable-length inputs in production systems.

Shandong Inspur Scientific Research Institute introduces a “broken stick” segmentation process that encodes positional information directly into attention score calculation without separate positional encoding modules. This approach naturally induces a locality bias in the attention distribution — making the model more likely to attend to nearby tokens — while retaining the capacity to model long-range dependencies, reducing both model complexity and attention dispersion in long sequences. The University of Science and Technology of China similarly assigns different sparsity levels to attention heads and neurons based on their importance scores during inference, enabling dynamic layer-level sparsity optimisation at runtime.

Track the latest sparse attention and long-context LLM patent filings with PatSnap Eureka.

Search Sparse Attention Patents in PatSnap Eureka →

Patent Landscape: Who Is Filing and What They Are Building

The patent data reveals a concentration of sparse attention innovation across several distinct clusters of assignees, each pursuing specific angles on the complexity reduction problem. Microsoft Technology Licensing is the most prominent Western-headquartered filer in this dataset, with a coherent portfolio spanning dynamic per-head sparsity pattern search and global-local attention decomposition enhanced by state space models. Microsoft’s approach is notable for its calibration-then-inference pipeline, which avoids runtime pattern search overhead during inference.

Chinese academic institutions collectively constitute the most active group. Tsinghua University addresses heterogeneous sparse rule assignment across variable input lengths. Peking University contributes instance-adaptive sparse pattern selection. Zhejiang Lab introduces anchor-based compression for plug-in deployment. The University of Electronic Science and Technology of China focuses on load-balanced multi-GPU scheduling for heterogeneous sparse attention heads, addressing the load imbalance problem that arises when retrieval heads, sparse heads, and stride heads are deployed across multi-GPU systems — proposing asynchronous weight preloading combined with KV cache management as the solution.

Figure 3 — Key assignees and their primary sparse attention innovation focus areas
Key patent assignees and their primary sparse attention mechanism innovation focus areas Assignee Primary Innovation Focus Filing Year Microsoft Technology Licensing Head-level calibrated sparsity + SSM global-local decomposition 2024–2026 Tsinghua University (清华大学) Heterogeneous Pareto-front sparse rule assignment 2024 Peking University (北京大学) Instance-adaptive sparse pattern selection via shallow layers 2023 Zhejiang Lab (之江实验室) Plug-in anchor compression; no retraining required 2026 Google LLC Structured factorisation for full attention at sparse cost 2022 Samsung Electronics Trained mask generation for attention window expansion in LLMs 2025 Applied Brain Research Inc. Implicit sub-quadratic attention via row-column similarities 2024 Harbin Institute of Technology Multi-stage dynamic sparse optimisation with early mask generation 2025
Key assignees identified in the sparse attention patent dataset and their primary technical focus areas; Chinese academic institutions and Microsoft Technology Licensing collectively account for the majority of filings.

Google LLC’s filing proposes a structured factorisation of the conditional distribution underlying self-attention, treating each position’s attention as a conditional expectation and enabling both direct and indirect attention — via group representations of local regions — to all other positions, achieving full attention capability with sparse computational cost. Samsung Electronics takes a deployment-first approach: its method applies a trained mask generation model that outputs layer-specific attention masks based on attention logits from recent tokens, effectively extending the practical attention window of an already-trained LLM without retraining.

Applied Brain Research Inc.’s implicit attention method represents the most fundamental departure from the masking paradigm. Rather than masking a quadratic computation, it reconstructs output vectors as matrices and transforms them via pairwise row-column similarities, yielding sub-quadratic complexity through a fundamentally different computation graph — one that entirely bypasses the explicit O(n²) score matrix. This approach is consistent with the direction signalled by WIPO‘s global patent trend data, which shows increasing filings in alternative sequence modelling architectures that move beyond the standard attention formulation.

Applied Brain Research Inc.’s implicit attention method achieves sub-quadratic complexity by reconstructing sequential dependencies through pairwise row-column similarities on output vectors, entirely bypassing the explicit O(n²) attention score matrix — a fundamentally different computation graph from all masking-based approaches.

Frequently asked questions

Sparse attention mechanism — key questions answered

Still have questions? Let PatSnap Eureka answer them for you.

Ask PatSnap Eureka for a Deeper Answer →

References

  1. Dynamic sparsity patterns for attention heads (WO) — Microsoft Technology Licensing, LLC, 2026
  2. Dynamic sparsity patterns for attention heads (US) — Microsoft Technology Licensing, LLC, 2026
  3. Method and device for sparse attention in transformer model using a convolution filter — Sogang University Industry-Academic Cooperation Foundation, 2025
  4. Multi-stage dynamic sparse optimisation method for transformer accelerators — Harbin Institute of Technology, 2025
  5. LLM inference method with attention head importance pruning — Shanghai Tianshu Zhixin Semiconductor, 2025
  6. Anchor compression mechanism for multi-layer attention model inference acceleration — Zhejiang Lab, 2026
  7. Token fusion-based LLM inference optimisation — Nankai University, 2024
  8. Divide and attend long range block attention — CHAOS INDUSTRIES, INC., 2025
  9. Efficient attention computation method for transformer architecture — Zhongshu (Xiamen) Information Technology, 2025
  10. Long sequence modeling via SSM-enhanced transformer (US) — Microsoft Technology Licensing, LLC, 2024
  11. Long sequence modeling using SSM-enhanced transformer (KR) — Microsoft Technology Licensing, LLC, 2025
  12. Sparse attention computation model and method — Peking University, 2023
  13. Heterogeneous scaling rule auto-assignment for sparse attention — Tsinghua University, 2024
  14. Full attention with sparse computational cost — Google LLC, 2022
  15. Inference acceleration for LLM processing of extremely long texts — University of Electronic Science and Technology of China, 2025
  16. Inference acceleration for LLM processing of extremely long texts (multi-GPU scheduling) — University of Electronic Science and Technology of China, 2025
  17. Accelerated inference method for LLMs — University of Science and Technology of China, 2025
  18. Method and system for implicit attention with sub-quadratic complexity — Applied Brain Research Inc., 2024
  19. Memory Efficient Attention Window Expansion For Trained LLMs — Samsung Electronics Co., Ltd., 2025
  20. Attention mechanism improvement method for long sequences — Shandong Inspur Scientific Research Institute, 2025
  21. Dynamic sparsity-based LLM inference acceleration — National University of Defense Technology, 2025
  22. Long sequence modeling via SSM-enhanced transformer (WO) — Microsoft Technology Licensing, LLC, 2024
  23. WIPO — World Intellectual Property Organization: Global Patent Trend Data
  24. IEEE — Institute of Electrical and Electronics Engineers: Research on Efficient Neural Network Inference
  25. ACM — Association for Computing Machinery: Sliding-Window Attention Architectures
  26. Nature — Adaptive Neural Architectures and Input-Conditional Computation
  27. PatSnap Blog — Innovation Intelligence Research
  28. PatSnap Eureka — AI-Powered Patent and R&D Intelligence Platform

All data and statistics in this article are sourced from the references above and from PatSnap‘s proprietary innovation intelligence platform.

Your Agentic AI Partner
for Smarter Innovation

PatSnap fuses the world’s largest proprietary innovation dataset with cutting-edge AI to
supercharge R&D, IP strategy, materials science, and drug discovery.

Book a demo