Sparse attention cuts transformer O(n²) complexity

Q: What is the quadratic complexity problem in transformer self-attention?

Standard transformer self-attention requires computation proportional to the square of sequence length — O(n²) — because every token must attend to every other token. As sequence length grows, inference latency rises sharply and memory consumption increases significantly, making long-context inference increasingly impractical.

Q: How do sparse attention masks reduce O(n²) complexity?

Sparse attention masks restrict each token's attention to a subset of other tokens rather than the full sequence. Because empirically only a small fraction of token pairs carry meaningful attention weights, applying these masks means only non-masked elements of the attention matrix are computed, directly reducing the effective computation from O(n²) to a sparser regime.

Q: What is anchor-based compression in sparse attention?

Anchor-based compression inserts 'anchor compression modules' into intermediate transformer layers. These modules perform importance analysis on the full input sequence to dynamically select a set of representative anchor positions. Because the number of anchor positions is much smaller than the full sequence length, attention computation complexity and GPU memory usage are both reduced. The approach developed by Zhejiang Lab requires no retraining and is compatible with LLaMA, GPT, BERT, and Qwen.

Q: What is instance-adaptive sparse pattern selection?

Instance-adaptive sparse pattern selection, proposed by Peking University, uses shallow transformer layers to process all input with standard attention, then a mode selector attached to the last shallow layer analyzes output hidden vectors and assigns weights to multiple pre-defined sparse patterns. Each deep transformer layer then executes sparse attention weighted by the pattern importance scores, enabling the model to select different sparse patterns for different inputs — outperforming static sparse masking.

Q: How does token fusion reduce transformer inference cost?

Token fusion exploits the empirical observation that many tokens in each layer's input and output are highly similar to one another. By grouping tokens according to similarity and fusing similar tokens into a compressed representation before feeding them to the FFN module, the effective matrix dimensions entering subsequent computation are reduced, lowering the overall computational load.

Q: What is sub-quadratic implicit attention?

Sub-quadratic implicit attention, developed by Applied Brain Research Inc., bypasses explicit pairwise score computation entirely. Instead of masking a quadratic computation, it reconstructs output vectors as matrices and transforms them via pairwise row-column similarities, yielding sub-quadratic complexity through a fundamentally different computation graph that avoids the explicit O(n²) score matrix altogether.

Sparse Attention Mechanism: Reducing Quadratic Complexity — PatSnap Insights

AI & Machine Learning

Standard transformer self-attention scales as O(n²) with sequence length — a bottleneck that makes long-context inference increasingly impractical. Drawing on approximately 60 patent filings from Microsoft, Google, Tsinghua University, Peking University, Samsung, and others, this article maps the four dominant strategies researchers are patenting to break that quadratic barrier.

PatSnap Insights Team Innovation Intelligence Analysts 16 April 2026 9 min read

Reviewed by the PatSnap Insights editorial team · 16 April 2026

Why O(n²) Complexity Breaks Long-Context Inference

Standard transformer self-attention computes pairwise interactions between every token and every other token in a sequence, producing a computational cost that scales as O(n²) with sequence length n. As documented across the patent dataset analysed here, this means that doubling a context window quadruples the computation — and as context windows grow to tens or hundreds of thousands of tokens, inference latency rises sharply and memory consumption increases significantly. The University of Electronic Science and Technology of China states this directly in its patent filings: the self-attention mechanism’s computational complexity grows quadratically with context length.

~60

Patent records & technical disclosures analysed

Dominant technical strategy categories identified

Jurisdictions covered: CN, US, KR, PCT

O(n²)

Complexity of standard self-attention being replaced

The dataset examined here encompasses approximately 60 patent records and technical disclosures spanning China, the United States, South Korea, and international PCT filings. The dominant assignees include Microsoft Technology Licensing LLC, Chinese academic institutions such as Tsinghua University, Harbin Institute of Technology, Peking University, and the University of Electronic Science and Technology of China, as well as Google LLC and Samsung Electronics. Together, these filings map four broad solution categories: structured and dynamic sparsity masking applied to attention heads; token-level compression and reduction; hierarchical local-global attention decomposition; and anchor-based or block-level attention approximation.

Standard transformer self-attention requires O(n²) computation proportional to the square of sequence length, causing inference latency to rise sharply and memory consumption to increase significantly as context windows grow — a bottleneck documented across approximately 60 patent filings from institutions including Microsoft, Google, Tsinghua University, and Samsung Electronics.

What is quadratic attention complexity?

In standard self-attention, each of the n tokens in a sequence must compute a similarity score against all other n tokens. This produces an n×n attention matrix, meaning both memory and computation scale as n². For a sequence of 10,000 tokens, this is 100 million pairwise operations; at 100,000 tokens, it is 10 billion — making long-context inference computationally prohibitive without algorithmic intervention.

The core empirical insight motivating sparse attention research is that attention matrices are highly sparse in practice: only a small fraction of token pairs carry meaningful attention weights. This structural sparsity can be exploited systematically, and the patent filings in this dataset represent the current frontier of how that exploitation is being engineered and protected across four distinct technical paradigms.

Figure 1 — Four dominant sparse attention strategy categories identified across the patent dataset

Approximate distribution of ~60 patent filings across the four dominant sparse attention strategy categories identified in the dataset; global-local decomposition represents the most active area of filing activity.

Masking, Pruning, and Dynamic Sparsity Patterns

Sparse attention masks reduce quadratic complexity by restricting each token’s attention to a subset of other tokens, so that only non-masked elements of the attention matrix are computed. The key empirical foundation is that attention matrices are highly sparse in practice — only a small fraction of token pairs carry meaningful weights — making systematic mask application both accurate and computationally effective.

Microsoft Technology Licensing’s patent filings introduce a two-stage pipeline for head-level sparsification. A calibration stage conducts a sparsity pattern search across all attention heads in the transformer layers; a subsequent inferencing stage then masks each head with the head-specific pattern discovered during calibration. By pre-filling context using precomputed sparse masks rather than dense attention scores, the system computes only the non-masked elements of the attention matrix — directly reducing the O(n²) computation to a sparser regime without incurring per-inference pattern search overhead. This calibration-then-inference separation is a deliberate engineering choice that makes the approach practical for deployment.

Microsoft Technology Licensing’s sparse attention pipeline discovers head-specific sparsity patterns during a one-time calibration stage and reuses them at inference time, avoiding per-inference pattern search overhead while directly reducing the O(n²) attention computation to a sparser regime.

An alternative approach avoids a calibration pipeline entirely by deriving structural patterns from the attention matrix itself. A method from Sogang University (서강대학교산학협력단) first performs a full attention operation for an initial number of steps to compute baseline attention scores, then applies convolution patterns to detect diagonal and vertical structural regularities in those scores. The extracted positional information is encoded into a sparse pattern P, which is applied from the next step onward — reducing the total number of attention operations without requiring expensive per-inference pattern search. This convolution-derived approach is particularly well-suited to deployment scenarios where calibration data is unavailable.

“By generating masks ahead of the actual computation — early mask generation — the method adjusts the data flow within the transformer, accelerating matrix operations while reducing computational load across independently optimised prefill and decode phases.”

Harbin Institute of Technology’s multi-stage dynamic sparse optimisation method determines an input matrix, then generates mask matrices to guide matrix multiplication independently at each stage of the transformer inference pipeline. The multi-stage design enables independent optimisation across the prefill and decode phases, which have fundamentally different computational profiles. For attention head importance-based pruning, a method from Shanghai Tianshu Zhixin Semiconductor computes an importance index for each attention head with respect to the current token before the attention module processes it, filters to retain only the target heads, and performs inference solely over the retained set. This approach further applies sparsification to feed-forward neurons, creating a two-level sparsity strategy across both the attention and MLP sublayers — a design that compound-reduces the total FLOPs consumed per inference step. According to IEEE research on efficient neural network inference, combining attention and MLP sparsification is among the most effective approaches to end-to-end inference acceleration.

Explore the full patent landscape for sparse attention mechanisms in PatSnap Eureka.

Explore Patent Data in PatSnap Eureka →

Token Compression and Block-Based Attention: Reducing n Before the Quadratic Hits

Token compression approaches attack the quadratic problem from a different angle: rather than masking the attention matrix, they reduce the effective sequence length presented to the attention mechanism. If n tokens can be compressed to m representative tokens where m is much less than n, the resulting attention computation scales as O(m²) instead of O(n²) — a reduction that compounds across layers.

Zhejiang Lab’s anchor compression method inserts “anchor compression modules” into intermediate transformer layers. These modules perform importance analysis on the full input sequence to dynamically select a set of representative anchor positions. The anchors then serve as query vectors and, together with the original sequence, generate a compressed representation through sparse attention. Because the number of anchor positions is much smaller than the full sequence length, both attention computation complexity and GPU memory usage are reduced. Critically, the anchor compression module uses a plug-in architecture that requires no retraining of the base model, making it deployable on pre-trained models such as LLaMA, GPT, BERT, and Qwen.

Zhejiang Lab’s anchor compression approach reduces transformer attention complexity by selecting representative anchor positions — far fewer than the full sequence length — and uses a plug-in architecture requiring no retraining, making it compatible with pre-trained models including LLaMA, GPT, BERT, and Qwen.

Nankai University’s token fusion method exploits the empirical observation that many tokens in each layer’s input and output are highly similar to one another. By grouping tokens according to similarity, generating a grouping mask matrix, and fusing similar tokens into a compressed representation before feeding them to the FFN module, the effective matrix dimensions entering subsequent computation are reduced. The method focuses specifically on the FFN sublayer — where parameters constitute a large share of total model parameters — as the primary beneficiary of this compression strategy.

Key finding: Precise-plus-fuzzy block attention

A method from Zhongshu (Xiamen) Information Technology partitions the key-value matrix into blocks, computes block-level averages, and uses correlation scores between the query and block averages — via top-k nucleus sampling logic — to determine which regions receive exact (precise) attention versus approximate (fuzzy) attention. Local neighbour regions receive precise computation; distant global regions receive block-averaged approximations. This combined strategy avoids the full quadratic cost while retaining both local and global contextual information.

The block-level analog to token fusion is described in CHAOS INDUSTRIES’ “Divide and attend long range block attention” method, which divides an input sequence of length n into N blocks, each shorter than n. Attention is computed within each block sequentially, with the output of the first block feeding as input for the computation of the next block’s output. This sequential block chaining provides long-range dependency modelling while keeping per-block attention computation sub-quadratic relative to the full sequence length — an approach that aligns with established ACM findings on sliding-window attention architectures for efficient sequence modelling.

Figure 2 — Sparse attention complexity reduction: from O(n²) full attention to sub-quadratic regimes

Illustrative comparison of compute cost across the four sparse attention paradigms: token compression reduces complexity to O(m²) where m is much less than n, while implicit and block-based methods achieve the lowest relative cost by restructuring the computation graph entirely.

Global-Local Decomposition and Hierarchical Architectures

Global-local attention decomposition splits the attention computation into two components: dense local attention within a bounded window, and a compressed global representation computed at sub-quadratic cost. This preserves both fine-grained local precision and long-range contextual awareness without applying full O(n²) attention across the entire sequence.

Microsoft Technology Licensing’s SSM-enhanced transformer implements this with a two-layer encoder architecture. A global layer computes a global self-attention vector for each token in a local input sequence drawn from a larger global input. A local layer then computes local self-attention over each local sequence and combines it with the global vector through addition and normalisation. The encoder representation thus includes both global context — captured at lower cost via the global layer’s aggregation over sub-sequences — and local precision, enabling long-range modelling without O(n²) cost over the full global sequence. This architecture is protected via filings in both the United States and South Korea.

Peking University’s instance-adaptive sparse pattern selection mechanism takes a different hierarchical approach. Shallow transformer layers process all input with standard attention; a mode selector attached to the last shallow layer analyses the output hidden vectors and assigns weights to multiple pre-defined sparse patterns. Each deep transformer layer then executes sparse attention weighted by the pattern importance scores. This instance-level adaptation enables the model to select different sparse patterns for different inputs, outperforming static single-pattern sparsification by tailoring computational allocation to individual input characteristics. Research published by Nature on adaptive neural architectures supports the finding that input-conditional computation allocation consistently outperforms fixed-budget approaches.

Peking University’s instance-adaptive sparse pattern selection uses shallow transformer layers to analyse input characteristics and assign weights to multiple pre-defined sparse patterns, enabling each deep transformer layer to execute sparse attention tailored to individual inputs — outperforming static sparse masking approaches.

Tsinghua University addresses a key limitation of uniform sparsity: the same sparse pattern cannot optimally serve different input lengths. Its method evaluates how each attention mask affects the model’s final prediction, determines accuracy loss across different input sequence lengths, and uses a Pareto-front search — with sequence length, accuracy loss, and density as objectives — to assign heterogeneous scaling rules, each with head-level hyperparameters, to different sparse attention heads at different input lengths. The result is a flexible sparse attention configuration that adapts to varying context window sizes, directly addressing the deployment challenge of variable-length inputs in production systems.

Shandong Inspur Scientific Research Institute introduces a “broken stick” segmentation process that encodes positional information directly into attention score calculation without separate positional encoding modules. This approach naturally induces a locality bias in the attention distribution — making the model more likely to attend to nearby tokens — while retaining the capacity to model long-range dependencies, reducing both model complexity and attention dispersion in long sequences. The University of Science and Technology of China similarly assigns different sparsity levels to attention heads and neurons based on their importance scores during inference, enabling dynamic layer-level sparsity optimisation at runtime.

Track the latest sparse attention and long-context LLM patent filings with PatSnap Eureka.

Search Sparse Attention Patents in PatSnap Eureka →

Patent Landscape: Who Is Filing and What They Are Building

The patent data reveals a concentration of sparse attention innovation across several distinct clusters of assignees, each pursuing specific angles on the complexity reduction problem. Microsoft Technology Licensing is the most prominent Western-headquartered filer in this dataset, with a coherent portfolio spanning dynamic per-head sparsity pattern search and global-local attention decomposition enhanced by state space models. Microsoft’s approach is notable for its calibration-then-inference pipeline, which avoids runtime pattern search overhead during inference.

Chinese academic institutions collectively constitute the most active group. Tsinghua University addresses heterogeneous sparse rule assignment across variable input lengths. Peking University contributes instance-adaptive sparse pattern selection. Zhejiang Lab introduces anchor-based compression for plug-in deployment. The University of Electronic Science and Technology of China focuses on load-balanced multi-GPU scheduling for heterogeneous sparse attention heads, addressing the load imbalance problem that arises when retrieval heads, sparse heads, and stride heads are deployed across multi-GPU systems — proposing asynchronous weight preloading combined with KV cache management as the solution.

Figure 3 — Key assignees and their primary sparse attention innovation focus areas

Key assignees identified in the sparse attention patent dataset and their primary technical focus areas; Chinese academic institutions and Microsoft Technology Licensing collectively account for the majority of filings.

Google LLC’s filing proposes a structured factorisation of the conditional distribution underlying self-attention, treating each position’s attention as a conditional expectation and enabling both direct and indirect attention — via group representations of local regions — to all other positions, achieving full attention capability with sparse computational cost. Samsung Electronics takes a deployment-first approach: its method applies a trained mask generation model that outputs layer-specific attention masks based on attention logits from recent tokens, effectively extending the practical attention window of an already-trained LLM without retraining.

Applied Brain Research Inc.’s implicit attention method represents the most fundamental departure from the masking paradigm. Rather than masking a quadratic computation, it reconstructs output vectors as matrices and transforms them via pairwise row-column similarities, yielding sub-quadratic complexity through a fundamentally different computation graph — one that entirely bypasses the explicit O(n²) score matrix. This approach is consistent with the direction signalled by WIPO‘s global patent trend data, which shows increasing filings in alternative sequence modelling architectures that move beyond the standard attention formulation.

Applied Brain Research Inc.’s implicit attention method achieves sub-quadratic complexity by reconstructing sequential dependencies through pairwise row-column similarities on output vectors, entirely bypassing the explicit O(n²) attention score matrix — a fundamentally different computation graph from all masking-based approaches.

Frequently asked questions

Sparse attention mechanism — key questions answered

What is the quadratic complexity problem in transformer self-attention?+

How do sparse attention masks reduce O(n²) complexity?+

What is anchor-based compression in sparse attention?+

What is instance-adaptive sparse pattern selection?+

How does token fusion reduce transformer inference cost?+

What is sub-quadratic implicit attention and how does it differ from sparse masking?+

Still have questions? Let PatSnap Eureka answer them for you.

Ask PatSnap Eureka for a Deeper Answer →

References

All data and statistics in this article are sourced from the references above and from PatSnap‘s proprietary innovation intelligence platform.

AI AGENTS

AI APPLICATIONS

OTHERS

INDUSTRIES

USE CASES

EXPLORE

ENGAGE

SUPPORT & SERVICES

Your Agentic AI Partner
for Smarter Innovation

Great, Please verify your email.

Sparse attention cuts transformer O(n²) complexity

Why O(n²) Complexity Breaks Long-Context Inference

Masking, Pruning, and Dynamic Sparsity Patterns

Token Compression and Block-Based Attention: Reducing n Before the Quadratic Hits

Global-Local Decomposition and Hierarchical Architectures

Patent Landscape: Who Is Filing and What They Are Building

Sparse attention mechanism — key questions answered

References

Your Agentic AI Partner
for Smarter Innovation

AI AGENTS

AI APPLICATIONS

OTHERS

INDUSTRIES

USE CASES

EXPLORE

ENGAGE

SUPPORT & SERVICES

Your Agentic AI Partner for Smarter Innovation

Great, Please verify your email.

Sign up

Great! Please verifyyour email.

Why O(n²) Complexity Breaks Long-Context Inference

Masking, Pruning, and Dynamic Sparsity Patterns

Token Compression and Block-Based Attention: Reducing n Before the Quadratic Hits

Global-Local Decomposition and Hierarchical Architectures

Patent Landscape: Who Is Filing and What They Are Building

Sparse attention mechanism — key questions answered

References

More from PatSnap Insights

KV Cache Optimisation in Large Language Model Inference

State Space Models vs Transformers: Patent Landscape Analysis

Efficient Transformer Inference: Hardware-Aware Scheduling Patents

Your Agentic AI Partner for Smarter Innovation

Your Agentic AI Partner
for Smarter Innovation

Great! Please verify
your email.

Your Agentic AI Partner
for Smarter Innovation