LLM Quantization & Compression 2026 — PatSnap Eureka
LLM Quantization & Compression Technology Landscape 2026
Analysis of 70+ patent records and literature sources spanning 2019–2026, covering post-training quantization, ultra-low bit-width binarization, low-rank decomposition, hybrid pipelines, and the emerging global assignee landscape across cloud, edge, and federated deployment targets.
Five Sub-Domains Defining the LLM Compression Field
LLM quantization and compression constitutes a cluster of techniques that transform high-precision floating-point model weights—typically FP32 or FP16—into lower-bit representations including integers, low-bit floats, or binary codes, while preserving functional accuracy. The field has become strategically critical as deployment targets expand from cloud data centers to edge devices, smartphones, and federated learning environments.
Among 70+ retrieved records spanning 2019–2026, the field spans five broad sub-domains: weight quantization targeting INT4, INT8, FP4, FP8, and 1-bit; structural decomposition via low-rank factorization and tensor networks; pruning including structured channel pruning and dynamic expert pruning in MoE architectures; hybrid pipelines combining quantization, pruning, and knowledge distillation; and hardware-aware deployment targeting edge, FPGA, and distributed federated systems.
Core technical challenges cited across results include quantization error accumulation at ultra-low bit-widths (sub-4-bit), activation outliers that resist standard quantization grids, hardware portability constraints, and accuracy recovery without full retraining. External standards bodies such as IEEE and IETF are increasingly engaged with precision formats relevant to this domain.
From Foundational Research to Competitive Product Differentiation
Three distinct phases characterize the 2019–2026 dataset: theoretical groundwork, mid-stage product-readiness, and a pronounced 2025–2026 filing surge from major hardware vendors and Chinese academic institutions.
Jurisdiction Distribution in Dataset
China (CN) dominates with 30+ records; US holds 15+; WO (PCT) accounts for 10+; IN, EP, GB, and TW represent smaller but growing clusters.
Innovation Timeline by Phase
Three distinct phases: foundational research (2019–2022), mid-stage product readiness (2022–2024), and a 2025–2026 acceleration surge from hardware vendors and academic institutions.
Four Patent Clusters Structuring the Compression Landscape
The dataset organizes into four primary technical clusters, each representing a distinct approach to reducing LLM memory footprint and inference cost.
Post-Training Quantization (PTQ)
The dominant approach: converting pre-trained LLM weights and activations to INT4, INT8, FP4, or FP8 using calibration datasets without full retraining. Key filers include Google LLC (activation-aware scaling, US/WO 2024), Samsung Electronics (joint weight-equalization + range-parameter learning targeting smartphones, WO/GB 2025), Amazon Technologies (PTQ with non-linear integer approximation covering LLaMA, US/WO 2025), and Intel Corporation (signed gradient descent for INT4 rounding, US 2025).
INT4 / INT8 / FP4 / FP8 targetsUltra-Low Bit-Width & Binarization
A distinct sub-cluster addressing 1-bit and sub-4-bit quantization where standard PTQ methods degrade severely. Shanghai Jiao Tong University combines post-training 1-bit quantization with backpropagation-based optimization (ARB-LLM method). Sun Yat-sen University applies Haar wavelet decomposition to weight matrices before BiLLM binary residual approximation. Microsoft’s ULP-Linear modules replace standard linear projections in transformer layers, while its distribution encoding patent exploits non-uniform weight distributions for improved GPU reduction.
1-bit / ULP / Wavelet-assistedLow-Rank Decomposition & Tensor Networks
These approaches decompose weight matrices into lower-rank factor products or tensor network representations with mathematically bounded accuracy loss. Multiverse Computing (Spain) decomposes LLM weight layers into compressed tensor networks (WO/EP/US/TW 2025). Yuanqi Core Semiconductor replaces full-precision weights W with quantized weight Q plus quantized low-rank factors L×R. NVIDIA’s 2026 filing provides training-free eigenspace low-rank error compensation. Hangzhou ByteArk Technology combines GPTQ and Hessian inverse quantization on Hadamard-decomposed matrices.
Tensor networks / GPTQ / HessianHybrid Pipeline Compression
Several patents combine multiple compression stages: pruning first, then quantization, then knowledge distillation to recover accuracy. Tsinghua University sequentially applies expert subnetwork construction by usage frequency, structural pruning, low-bit fixed-point quantization, and teacher-student knowledge distillation. The Institute of Automation, Chinese Academy of Sciences targets Mixture-of-Experts (MoE) LLMs with layer-by-layer quantization followed by task-type-aware dynamic expert pruning via frequency-based importance scoring. NVIDIA addresses both structured and unstructured pruning drawbacks through combined sparsification-quantization.
Prune → Quantize → DistillWhere LLM Compression Is Being Deployed
The dataset reveals four primary deployment contexts, from smartphones to cloud GPU fleets to privacy-preserving federated environments.
Global IP Holders in LLM Quantization & Compression
Large US technology companies dominate in jurisdictional breadth (PCT, US, EP), while Chinese academic institutions generate high filing volumes domestically. L&T Technology Services (India) and Multiverse Computing (Spain) represent notable non-US, non-Chinese IP positions.
| Assignee | Records in Dataset | Key Jurisdiction(s) | Primary Approach |
|---|---|---|---|
| L&T Technology Services Limited | 5 | US, IN, EP | Compression & tuning tooling, pruning |
| Microsoft Technology Licensing, LLC | 4 | US, WO | ULP quantization, distribution encoding, LoRA compression |
| Google LLC | 4 | US, WO, IN | Activation-aware PTQ, federated decentralized LLM |
| Multiverse Computing S.L. | 4 | EP, US, WO, TW | Tensor network compression |
| Baidu (Beijing Baidunetsun) | 3 | CN | Sparsification + quantization |
| NVIDIA Corporation | 3 | US, CN | Sparsification-quantization, eigenspace error compensation |
| Qualcomm Incorporated | 3 | US, WO | Joint pruning-quantization, vector quantization, holistic layout |
Six Signals from 2025–2026 Filings
The most recently filed patents in this dataset reveal where the compression frontier is moving — from MoE-specific compression to custom silicon for compressed inference.
MoE-Specific Compression
The Institute of Automation, Chinese Academy of Sciences’ 2025–2026 filings explicitly target Mixture-of-Experts architectures, applying quantization followed by task-type-driven dynamic expert pruning. Only 2 records in this dataset address MoE models despite their increasing commercial prevalence in frontier models (GPT-4, Mistral, DeepSeek), making this a high-priority IP whitespace.
Training-Free Eigenspace Error Compensation
NVIDIA’s 2026 US filing proposes accuracy recovery without any retraining, based on eigenspace low-rank representations (priority filed September 2024). This decouples compression from training infrastructure, enabling faster deployment cycles and addressing the flexibility limitation of hardware-constrained sparse and quantized formats.
Polar Coordinate Vector Quantization
Harbin Institute of Technology (Shenzhen)’s 2025 filings introduce quantization in polar coordinate space rather than Cartesian, constructing separate direction codebooks and magnitude codebooks to address the sensitivity of vector direction in LLM weight distributions — a novel mathematical reformulation of the quantization grid distinct from all prior scalar and Cartesian vector approaches.
Low-Rank + GPTQ/Hessian Becoming Standard
Multiple 2025–2026 filings from Yuanqi Core Semiconductor and Hangzhou ByteArk Technology combine low-rank decomposition with second-order Hessian-based quantization methods (GPTQ), indicating that second-order calibration is becoming standard practice in commercial-grade quantization pipelines targeting cloud inference, finance, healthcare, and government sectors.
IP Strategy Signals for R&D and Patent Teams
Post-training quantization has become table stakes. Every major US technology company in this dataset—Google, Amazon, Microsoft, NVIDIA, Intel, Qualcomm, and Samsung—has active PTQ filings. R&D teams entering without a PTQ position face a crowded patent landscape and should focus differentiation on calibration efficiency, hardware specificity, or extreme bit-width regimes.
The accuracy-compression frontier is shifting below 4 bits. Multiple 2025–2026 filings target INT4 or lower—including 1-bit binarization, FP4, and ULP—where standard integer quantization fails. Patent opportunities remain in error compensation, residual correction, and non-uniform codebook design at these bit-widths. Activation quantization remains harder than weight quantization and is a less-crowded filing space.
China is the highest-volume filing jurisdiction with strong academic-to-startup translation. Chinese universities including Tsinghua, Shanghai Jiao Tong, Sun Yat-sen, Harbin IT Shenzhen, Xiamen, Nanjing, and Beihang appear alongside startups such as Yuanqi Core Semiconductor and Hangzhou ByteArk Technology, indicating a rapid commercialization pathway. Non-Chinese companies should monitor CN publications for early disclosure of novel techniques before PCT or US filing. Relevant monitoring tools are available via PatSnap’s API for automated CN publication tracking.
Federated and distributed quantization is a structurally distinct IP cluster requiring simultaneous optimization of model size for communication cost, local accuracy, and aggregation efficiency. Google, Shenzhen Big Data Institute, and City University of Hong Kong are the identifiable leaders in this dataset; the space remains relatively open compared to standalone PTQ. The WIPO PCT system is the primary vehicle for international filing in this sub-cluster. Competitive intelligence resources are available at PatSnap Customers.
- MoE-specific compression: only 2 records in dataset despite frontier model prevalence
- Sub-4-bit activation quantization: harder than weight quantization, less crowded
- Non-uniform codebook design at 1-bit / ULP bit-widths
- Federated quantization: relatively open vs. standalone PTQ cluster
- Custom silicon for compressed inference: Tsinghua 2026 CN filing is early signal
- Polar coordinate quantization: novel reformulation with limited prior art
- Crowded PTQ landscape: all major US vendors have active filings
- CN publication lag: novel techniques may appear in CN before PCT/US
- Hardware portability constraints on GPU INT kernels and sparse tensor units
- Quantization error accumulation at sub-4-bit remains unsolved at scale
LLM Quantization & Compression — key questions answered
LLM quantization transforms high-precision floating-point model weights (typically FP32 or FP16) into lower-bit representations—integers, low-bit floats, or binary codes—while preserving functional accuracy, reducing memory footprint, computational cost, and inference latency.
In this dataset, L&T Technology Services Limited leads with 5 records, followed by Microsoft Technology Licensing LLC and Google LLC with 4 records each, and Multiverse Computing S.L. with 4 records. NVIDIA, Qualcomm, and Baidu each hold 3 records.
Post-training quantization converts a pre-trained LLM’s weights and/or activations to low-bit formats (INT4, INT8, FP4, FP8) using calibration datasets, without full retraining. Techniques include scaling factor optimization, range parameter learning, and activation-aware weight equalization.
Multiple 2025–2026 filings target INT4 or lower (1-bit binarization, FP4, ULP), where standard integer quantization fails. Patent opportunities remain in error compensation, residual correction, and non-uniform codebook design at these bit-widths, particularly for activation quantization which remains harder than weight quantization.
Only 2 records in this dataset specifically address Mixture-of-Experts (MoE) models, despite their increasing commercial prevalence. IP strategists should treat task-aware expert pruning and layer-selective quantization for MoE as a high-priority filing opportunity with limited prior art in this dataset.
China (CN) dominates with 30+ records in this dataset, reflecting strong domestic filing activity and active academic-to-patent translation from institutions including Tsinghua University, Shanghai Jiao Tong University, and the Chinese Academy of Sciences. US filings (15+ records) are concentrated among large technology companies.
PatSnap Eureka searches patents and research literature to answer instantly.