Book a demo

Cut patent&paper research from weeks to hours with PatSnap Eureka AI!

Try now

LLM Quantization & Compression 2026 — PatSnap Eureka

LLM Quantization & Compression 2026 — PatSnap Eureka
Tools Explore in Eureka
Reading14 min
PublishedJun 2, 2025
Coverage2019–2026
Patent Landscape 2026

LLM Quantization & Compression Technology Landscape 2026

Analysis of 70+ patent records and literature sources spanning 2019–2026, covering post-training quantization, ultra-low bit-width binarization, low-rank decomposition, hybrid pipelines, and the emerging global assignee landscape across cloud, edge, and federated deployment targets.

Fig. 01 — Top Assignees by Patent Records in Dataset
LLM Quantization Top Assignees: L&T Technology Services 5, Microsoft 4, Google 4, Multiverse Computing 4, Baidu 3, NVIDIA 3, Qualcomm 3, Amazon 2, Samsung 2 Horizontal bar chart showing patent record counts per assignee in the LLM quantization and compression dataset, 2019–2026. Source: PatSnap Eureka patent analysis. RECORDS IN DATASET L&T Technology Svcs 5 Microsoft 4 Google LLC 4 Multiverse Computing 4 Baidu 3 NVIDIA Corporation 3 Qualcomm 3 Amazon Technologies 2 Samsung Electronics 2
Published by PatSnap Insights Team · · 14 min read Verified by PatSnap Eureka Data
Technology Overview

Five Sub-Domains Defining the LLM Compression Field

LLM quantization and compression constitutes a cluster of techniques that transform high-precision floating-point model weights—typically FP32 or FP16—into lower-bit representations including integers, low-bit floats, or binary codes, while preserving functional accuracy. The field has become strategically critical as deployment targets expand from cloud data centers to edge devices, smartphones, and federated learning environments.

Among 70+ retrieved records spanning 2019–2026, the field spans five broad sub-domains: weight quantization targeting INT4, INT8, FP4, FP8, and 1-bit; structural decomposition via low-rank factorization and tensor networks; pruning including structured channel pruning and dynamic expert pruning in MoE architectures; hybrid pipelines combining quantization, pruning, and knowledge distillation; and hardware-aware deployment targeting edge, FPGA, and distributed federated systems.

Core technical challenges cited across results include quantization error accumulation at ultra-low bit-widths (sub-4-bit), activation outliers that resist standard quantization grids, hardware portability constraints, and accuracy recovery without full retraining. External standards bodies such as IEEE and IETF are increasingly engaged with precision formats relevant to this domain.

PatSnap Eureka Dataset covers 70+ patent and literature records, 2019–2026, across CN, US, WO, IN, EP, GB, and TW jurisdictions. Explore the data ↗
70+
Patent & literature records analyzed
5
Core compression sub-domains identified
2019
Earliest foundational records in dataset
30+
CN patent records — dominant jurisdiction
15+
US patent records in dataset
10+
WO (PCT) international filings
Compression Technique Tags
Post-Training Quantization Ultra-Low Bit / Binarization Low-Rank Decomposition Hybrid Pipeline Hardware-Aware Deployment Federated Quantization
Innovation Timeline

From Foundational Research to Competitive Product Differentiation

Three distinct phases characterize the 2019–2026 dataset: theoretical groundwork, mid-stage product-readiness, and a pronounced 2025–2026 filing surge from major hardware vendors and Chinese academic institutions.

Jurisdiction Distribution in Dataset

China (CN) dominates with 30+ records; US holds 15+; WO (PCT) accounts for 10+; IN, EP, GB, and TW represent smaller but growing clusters.

LLM Quantization Patent Jurisdiction Distribution: CN 30+, US 15+, WO 10+, IN 4, EP 3, GB 1, TW 1 Horizontal bar chart showing the number of LLM quantization patent records per jurisdiction in the 2019–2026 dataset. Source: PatSnap Eureka patent analysis. 0 8 16 24 30+ CN 30+ US 15+ WO 10+ IN 4 EP 3 GB 1 TW 1

Innovation Timeline by Phase

Three distinct phases: foundational research (2019–2022), mid-stage product readiness (2022–2024), and a 2025–2026 acceleration surge from hardware vendors and academic institutions.

LLM Quantization Innovation Phases: Foundational 2019–2022 (GOBO 3-bit BERT, lattice vector quantization 10.7x compression), Mid-Stage 2022–2024 (Baidu, Qualcomm, Google PTQ filings), Acceleration 2025–2026 (Samsung, Amazon, NVIDIA, Microsoft, MoE compression) Process diagram showing three innovation phases in LLM quantization patent activity from 2019 to 2026. Source: PatSnap Eureka patent and literature analysis. 2019–2022 Foundational GOBO 3-bit BERT no fine-tuning 10.7× compression on MobileNet Tsinghua first CN institutional patent 2022–2024 Mid-Stage Baidu sparsification + quantization patents Qualcomm joint pruning-quantization Google LLM quant US + WO Dec 2024 2025–2026 Acceleration Samsung, Amazon NVIDIA, Microsoft MoE compression emerges as sub-field Federated quant commercializing
PatSnap Eureka Timeline derived from patent filing and grant dates across 70+ records in dataset. Early literature records from 2020 include GOBO (3-bit BERT) and Universal Deep Neural Network Compression (10.7× on MobileNet). Explore timeline ↗
Key Technology Approaches

Four Patent Clusters Structuring the Compression Landscape

The dataset organizes into four primary technical clusters, each representing a distinct approach to reducing LLM memory footprint and inference cost.

Cluster 1

Post-Training Quantization (PTQ)

The dominant approach: converting pre-trained LLM weights and activations to INT4, INT8, FP4, or FP8 using calibration datasets without full retraining. Key filers include Google LLC (activation-aware scaling, US/WO 2024), Samsung Electronics (joint weight-equalization + range-parameter learning targeting smartphones, WO/GB 2025), Amazon Technologies (PTQ with non-linear integer approximation covering LLaMA, US/WO 2025), and Intel Corporation (signed gradient descent for INT4 rounding, US 2025).

INT4 / INT8 / FP4 / FP8 targets
Cluster 2

Ultra-Low Bit-Width & Binarization

A distinct sub-cluster addressing 1-bit and sub-4-bit quantization where standard PTQ methods degrade severely. Shanghai Jiao Tong University combines post-training 1-bit quantization with backpropagation-based optimization (ARB-LLM method). Sun Yat-sen University applies Haar wavelet decomposition to weight matrices before BiLLM binary residual approximation. Microsoft’s ULP-Linear modules replace standard linear projections in transformer layers, while its distribution encoding patent exploits non-uniform weight distributions for improved GPU reduction.

1-bit / ULP / Wavelet-assisted
Cluster 3

Low-Rank Decomposition & Tensor Networks

These approaches decompose weight matrices into lower-rank factor products or tensor network representations with mathematically bounded accuracy loss. Multiverse Computing (Spain) decomposes LLM weight layers into compressed tensor networks (WO/EP/US/TW 2025). Yuanqi Core Semiconductor replaces full-precision weights W with quantized weight Q plus quantized low-rank factors L×R. NVIDIA’s 2026 filing provides training-free eigenspace low-rank error compensation. Hangzhou ByteArk Technology combines GPTQ and Hessian inverse quantization on Hadamard-decomposed matrices.

Tensor networks / GPTQ / Hessian
Cluster 4

Hybrid Pipeline Compression

Several patents combine multiple compression stages: pruning first, then quantization, then knowledge distillation to recover accuracy. Tsinghua University sequentially applies expert subnetwork construction by usage frequency, structural pruning, low-bit fixed-point quantization, and teacher-student knowledge distillation. The Institute of Automation, Chinese Academy of Sciences targets Mixture-of-Experts (MoE) LLMs with layer-by-layer quantization followed by task-type-aware dynamic expert pruning via frequency-based importance scoring. NVIDIA addresses both structured and unstructured pruning drawbacks through combined sparsification-quantization.

Prune → Quantize → Distill
PatSnap Eureka All cluster assignments derived from patent abstracts and claims in the 70+ record dataset. See PatSnap Analytics for full landscape mapping. Explore clusters ↗
Application Domains

Where LLM Compression Is Being Deployed

The dataset reveals four primary deployment contexts, from smartphones to cloud GPU fleets to privacy-preserving federated environments.

Edge & Mobile
Samsung — Smartphones
Joint weight-equalization + range-parameter learning targeting resource-constrained devices (WO/GB 2025)
HCL Technologies — Edge DCV
Device Capability Vectors (DCV) and Knob Vectors (KV) per layer to optimize hybrid compression for specific hardware, validated by Semantic Preservation Score (IN 2025)
Qualcomm — On-Device ML
Holistic layout-vectorization-quantization for on-device weight decomposition and memory linearization optimized for signed integer dot products (US 2025)
Cloud & Enterprise AI
Amazon — LLaMA PTQ
PTQ system explicitly covering large-scale cloud deployment using LLaMA models with non-linear integer approximation (US/WO 2025)
Microsoft — GPU Fleet Reduction
ULP and distribution encoding work targets GPU count reduction, directly reducing cloud operational costs (US/WO 2025–2026)
NVIDIA — GPU Inference
Sparsification-quantization patent targets GPU-accelerated inference acceleration (US 2025)
🔒
Unlock Federated & Specialized Domain Analysis
See how City University of Hong Kong addresses the 80 GB GPU barrier for Llama 2 7B, and how Shenzhen Big Data Institute reduces communication overhead in distributed quantization.
Federated fine-tuningDistributed deploymentMultimodal compression+ more
Generate full report in Eureka →
PatSnap Eureka Application domain assignments derived from patent claims and described deployment scenarios in the dataset. See PatSnap Life Sciences for biomedical AI compression use cases. Explore applications ↗
Assignee Landscape

Global IP Holders in LLM Quantization & Compression

Large US technology companies dominate in jurisdictional breadth (PCT, US, EP), while Chinese academic institutions generate high filing volumes domestically. L&T Technology Services (India) and Multiverse Computing (Spain) represent notable non-US, non-Chinese IP positions.

Assignee Records in Dataset Key Jurisdiction(s) Primary Approach
L&T Technology Services Limited 5 US, IN, EP Compression & tuning tooling, pruning
Microsoft Technology Licensing, LLC 4 US, WO ULP quantization, distribution encoding, LoRA compression
Google LLC 4 US, WO, IN Activation-aware PTQ, federated decentralized LLM
Multiverse Computing S.L. 4 EP, US, WO, TW Tensor network compression
Baidu (Beijing Baidunetsun) 3 CN Sparsification + quantization
NVIDIA Corporation 3 US, CN Sparsification-quantization, eigenspace error compensation
Qualcomm Incorporated 3 US, WO Joint pruning-quantization, vector quantization, holistic layout
🔒
Unlock Full Assignee Table & IP Positioning Analysis
See Amazon, Samsung, Tsinghua, Shenzhen Big Data Institute, and 5 more assignees with their filing strategies and jurisdiction coverage.
Amazon PTQ strategySamsung smartphone focusTsinghua MoE pipeline+ 5 more assignees
View full table in Eureka →
PatSnap Eureka Filing counts represent records retrieved in this targeted dataset only. Innovation is moderately concentrated: US companies lead in jurisdictional breadth; Chinese academic institutions lead in domestic volume. See PatSnap customer cases for competitive intelligence workflows. Explore assignees ↗
Emerging Directions

Six Signals from 2025–2026 Filings

The most recently filed patents in this dataset reveal where the compression frontier is moving — from MoE-specific compression to custom silicon for compressed inference.

MoE-Specific Compression

The Institute of Automation, Chinese Academy of Sciences’ 2025–2026 filings explicitly target Mixture-of-Experts architectures, applying quantization followed by task-type-driven dynamic expert pruning. Only 2 records in this dataset address MoE models despite their increasing commercial prevalence in frontier models (GPT-4, Mistral, DeepSeek), making this a high-priority IP whitespace.

Training-Free Eigenspace Error Compensation

NVIDIA’s 2026 US filing proposes accuracy recovery without any retraining, based on eigenspace low-rank representations (priority filed September 2024). This decouples compression from training infrastructure, enabling faster deployment cycles and addressing the flexibility limitation of hardware-constrained sparse and quantized formats.

Polar Coordinate Vector Quantization

Harbin Institute of Technology (Shenzhen)’s 2025 filings introduce quantization in polar coordinate space rather than Cartesian, constructing separate direction codebooks and magnitude codebooks to address the sensitivity of vector direction in LLM weight distributions — a novel mathematical reformulation of the quantization grid distinct from all prior scalar and Cartesian vector approaches.

Low-Rank + GPTQ/Hessian Becoming Standard

Multiple 2025–2026 filings from Yuanqi Core Semiconductor and Hangzhou ByteArk Technology combine low-rank decomposition with second-order Hessian-based quantization methods (GPTQ), indicating that second-order calibration is becoming standard practice in commercial-grade quantization pipelines targeting cloud inference, finance, healthcare, and government sectors.

🔒
Unlock Federated Quantization & Custom Silicon Signals
Access the full analysis of quantization-aware federated fine-tuning and Tsinghua’s 2026 dedicated low-bit quantization processor patent.
Federated fine-tuning signalsCustom silicon for LLM inference80 GB GPU barrier solution
Unlock in Eureka →
PatSnap Eureka Emerging directions derived from the most recently filed patents (2025–2026) in this dataset. Explore PatSnap Analytics for real-time filing alerts. Explore emerging signals ↗
Strategic Implications

IP Strategy Signals for R&D and Patent Teams

Post-training quantization has become table stakes. Every major US technology company in this dataset—Google, Amazon, Microsoft, NVIDIA, Intel, Qualcomm, and Samsung—has active PTQ filings. R&D teams entering without a PTQ position face a crowded patent landscape and should focus differentiation on calibration efficiency, hardware specificity, or extreme bit-width regimes.

The accuracy-compression frontier is shifting below 4 bits. Multiple 2025–2026 filings target INT4 or lower—including 1-bit binarization, FP4, and ULP—where standard integer quantization fails. Patent opportunities remain in error compensation, residual correction, and non-uniform codebook design at these bit-widths. Activation quantization remains harder than weight quantization and is a less-crowded filing space.

China is the highest-volume filing jurisdiction with strong academic-to-startup translation. Chinese universities including Tsinghua, Shanghai Jiao Tong, Sun Yat-sen, Harbin IT Shenzhen, Xiamen, Nanjing, and Beihang appear alongside startups such as Yuanqi Core Semiconductor and Hangzhou ByteArk Technology, indicating a rapid commercialization pathway. Non-Chinese companies should monitor CN publications for early disclosure of novel techniques before PCT or US filing. Relevant monitoring tools are available via PatSnap’s API for automated CN publication tracking.

Federated and distributed quantization is a structurally distinct IP cluster requiring simultaneous optimization of model size for communication cost, local accuracy, and aggregation efficiency. Google, Shenzhen Big Data Institute, and City University of Hong Kong are the identifiable leaders in this dataset; the space remains relatively open compared to standalone PTQ. The WIPO PCT system is the primary vehicle for international filing in this sub-cluster. Competitive intelligence resources are available at PatSnap Customers.

PatSnap Eureka Strategic implications derived from patent assignee analysis and filing pattern observations in this dataset only. Not a comprehensive industry view. Explore IP strategy ↗
IP Whitespace Signals
  • MoE-specific compression: only 2 records in dataset despite frontier model prevalence
  • Sub-4-bit activation quantization: harder than weight quantization, less crowded
  • Non-uniform codebook design at 1-bit / ULP bit-widths
  • Federated quantization: relatively open vs. standalone PTQ cluster
  • Custom silicon for compressed inference: Tsinghua 2026 CN filing is early signal
  • Polar coordinate quantization: novel reformulation with limited prior art
Key Risk Factors
  • Crowded PTQ landscape: all major US vendors have active filings
  • CN publication lag: novel techniques may appear in CN before PCT/US
  • Hardware portability constraints on GPU INT kernels and sparse tensor units
  • Quantization error accumulation at sub-4-bit remains unsolved at scale
Frequently asked questions

LLM Quantization & Compression — key questions answered

Still have questions? PatSnap Eureka can answer them instantly from patent and research data. Ask Eureka ↗
PatSnap Eureka

Generate Your Own LLM Compression Patent Landscape

Join 18,000+ innovators using PatSnap Eureka to generate reports like this one for any technology area.

Ask anything about LLM quantization and compression.
PatSnap Eureka searches patents and research literature to answer instantly.
Powered by PatSnap Eureka
Link copied to clipboard