Quantized AI Models on Edge Chips — PatSnap Eureka
Deploying Quantized AI Models on Industrial Edge Chips
A synthesis of 35+ patents mapping the full deployment pipeline — from float-to-fixed-point conversion and operator adaptation through heterogeneous hardware scheduling to adaptive model switching at runtime.
From Floating-Point Training to Fixed-Point Deployment
The foundational step in deploying AI models on industrial edge chips is converting full-precision floating-point models into lower-precision integer representations that match the arithmetic capabilities of edge silicon. The canonical pipeline — train on PC, quantize, then transfer to embedded hardware — involves training a floating-point network on a PC host, converting it to a fixed-point embedded model using per-layer quantization formulas, preprocessing quantization data, and executing all accelerated operators in hardware mode on an embedded AI accelerator. The result is reduced model storage footprint, accelerated inference, improved compute density, and lower operational power consumption.
A more sophisticated variant is quantization-aware training (QAT), where quantization error is injected into the training graph before deployment. Hikvision's approach replaces each network layer with hardware-compatible target operator equivalents and conducts QAT using the edge device's underlying operator library — leveraging TorchScript as an intermediate representation and a lightweight engine such as PaddleSlim, without dependency on the full PyTorch training framework.
For ASIC-class chips where standard quantization schemes fail due to non-standard bitwidths, a two-stage format conversion approach applies a first-format conversion to the pretrained model, performs full-integer quantization at a configurable bit-width, then applies a second-format conversion to produce the final deployable model. This explicitly addresses the mismatch between academic quantization schemes and the constrained bitwidths of production AI silicon. Learn more about patent landscape analysis for edge AI on the PatSnap platform.
Post-training static quantization (PTQ) with sensitivity-guided mixed precision evaluates both pruning sensitivity and quantization sensitivity layer-by-layer, applies MinMax calibration to determine activation and weight ranges, and assigns lower bit-widths to insensitive blocks while preserving higher precision on sensitive modules using the PTQ formula: quantized weight = round(scale × weight + zero_point), clipped to the target bit range.
Technical Theme Distribution Across 35+ Edge AI Patents
Four dominant clusters identified in the 2020–2026 corpus: quantization pipelines, operator conversion, heterogeneous scheduling, and adaptive compression.
Core Technical Themes in Edge AI Quantization Patents
Patent count by dominant technical theme across the 35+ document corpus (2020–2026).
Patent Assignees by Category
Industrial AI companies lead filings, followed by academic institutions and state-owned enterprises.
Cross-Platform Format Conversion and Operator Substitution
Operator incompatibility is the most common deployment failure mode. These four patent-backed strategies systematically eliminate it.
Ethos-N NPU Operator Co-Design at Training Time
Shenzhen Unilumin's approach trains models using NPU-compatible operators in PyTorch, then converts parameters and reconstructs operator combinations to produce a model natively compatible with the NPU's TensorFlow-centric toolchain. Naive one-to-one operator mapping introduces redundant intermediate computation nodes, degrades quantization accuracy, and can produce incorrect inference results — problems eliminated by co-design at training time.
Arm Ethos-N NPU EcosystemHikvision Private Operator Substitution Workflow
A three-stage workflow: detect non-standard operators absent from the target intermediate format's base operator set, replace each with a functionally equivalent composition of base operators (termed "private operators"), convert to an intermediate format (QIR), then convert again to the target platform's native format, followed by per-operator quantization. This layered substitution eliminates unsupported operator errors while preserving quantization fidelity.
QIR Intermediate FormatAutomatic Matching with Dynamic Verilog/HLS Synthesis
Xi'an Tengkun's low-code platform identifies the target hardware profile, selects quantization strategy based on model parameter magnitude, and converts models using standard tooling (e.g., TFLite Micro Interpreter). When no existing hardware profile matches, it dynamically generates Verilog/HLS code templates to synthesize compatible logic, establishes a global virtual address space, and implements RDMA-based zero-copy data transfer between NPU and GPU.
RDMA Zero-Copy TransferChangan Automotive Operator-to-Compute-Unit Assignment
After quantization, operators in the quantized model are compiled against the edge chip's instruction set, assigned to specific compute units (e.g., CPU cores, NPU tiles), and scheduled with per-unit priority policies to maximize throughput given resource contention. This reflects the need to treat edge chip deployment not merely as a format conversion problem but as a real-time resource allocation problem.
Real-Time Resource AllocationScheduling Quantized Models Across NPU, CPU, DSP, and FPGA
Industrial edge SoCs integrate multiple compute units with distinct instruction sets and latency profiles. These patents codify how to map quantized models across them efficiently.
Three-Layer Heterogeneous Scheduling Architecture
Xi'an Xingxun's framework assigns feature extraction operators to the NPU, offloads classification operators to the multi-core CPU, manages on-chip SRAM through a virtual memory paging mechanism, and uses a DMA controller to implement zero-copy data transfers — eliminating redundant memory copies between processing stages. The same framework applies structural pruning based on attention-head importance scores and dynamic feed-forward network sparsification driven by input tensor entropy values.
Qualcomm Hybrid Fixed/Floating-Point Inference-Training Pipeline
The ANN model runs forward inference entirely in fixed-point format on a DSP, while backward gradient computation is selectively routed to either the GPU (floating-point) or the DSP (fixed-point) depending on the measured loss magnitude. This enables on-chip continual learning without dedicated floating-point hardware paths for routine inference — critical for industrial edge devices that must adapt to distribution shift without cloud connectivity.
Who Is Driving Edge AI Quantization Innovation?
Hikvision (Hangzhou Hikvision Digital Technology) is the most prolific assignee in the corpus, contributing at least three distinct patents covering quantization-aware training pipelines, operator substitution for cross-platform model deployment, and sub-block quantization for inference acceleration on edge devices. Its innovations consistently target the full model lifecycle from cloud training through edge deployment. Explore Hikvision's patent portfolio via PatSnap Analytics.
Xi'an Xingxun Intelligent Communication Technology has filed two substantially identical patents (active, 2025) covering the combined structural pruning + dynamic sparsification + mixed-precision quantization + heterogeneous scheduling deployment pipeline, signaling a focus on end-to-end compression-to-deployment automation for complex industrial scenes.
Qualcomm contributes the only PCT-family patent in this corpus, covering on-device hybrid precision inference-training pipelines using heterogeneous GPU/DSP hardware — reflecting a global standardization interest in fixed-point/floating-point co-execution architectures for mobile and industrial edge SoCs. This aligns with WIPO's growing body of edge AI PCT filings.
China Mobile Research Institute addresses the storage-compute integration dimension through model weight deployment on processing-in-memory chips, automatically mapping neural network weight matrices to idle crossbar arrays in compute-in-memory (CiM) chips — a hardware approach that avoids the von Neumann memory bottleneck entirely.
Academic contributors — including Wuhan University, Tongji University, Chongqing University, and Chongqing University of Posts and Telecommunications — contribute foundational compression and deployment methodology that industrial assignees translate into product-level implementations. PatSnap's materials and engineering solutions surface these academic-to-industry technology transfer patterns across sectors.
China Railway High-Tech Industry and China Southern Power Grid represent industrial adopters independently developing domain-specific edge deployment methods for fault diagnosis and power infrastructure use cases, reflecting sectoral urgency around industrial AI edge deployment. The IEEE has published extensively on the reliability requirements driving these safety-critical deployments.
Seven Deployment Principles from 35+ Patents
Distilled from the full patent corpus — actionable guidance for R&D engineers and embedded AI architects.
Quantization Is a Multi-Stage Workflow, Not a Single Step
From float-to-fixed conversion through hardware-mode operator acceleration, each stage must be co-designed with the target chip's arithmetic capabilities. The canonical pipeline involves training, per-layer conversion, data preprocessing, and hardware-mode execution on an embedded AI accelerator.
Shanghai Qigan · Zhuhai YizhiOperator Incompatibility Is the Most Common Deployment Failure
Detecting unsupported operators, substituting them with hardware-compatible equivalents, and ensuring format conversion does not introduce redundant compute nodes is essential. Hikvision's private operator substitution and Unilumin's framework-aware training approach both address this systematically.
Hikvision · Shenzhen UniluminMixed-Precision Outperforms Uniform Quantization
Assigning lower bit-widths to insensitive layers and higher bit-widths to sensitive ones — guided by structured sensitivity analysis — preserves accuracy while maximizing compression. Tongji University's end-to-end sensitivity-guided PTQ method and Xi'an Xingxun's dynamic mixed-precision pipeline both demonstrate this principle.
Tongji University · Xi'an XingxunHeterogeneous Scheduling Is Essential for Full Chip Utilization
Feature extraction operators should be routed to NPU/FPGA fabric, while classification and control operators are better suited to multi-core CPUs. This hardware-aware partitioning also enables zero-copy DMA data transfers, as codified by Xi'an Xingxun and Qualcomm's hybrid fixed/floating-point inference-training pipeline.
Xi'an Xingxun · QualcommMap your edge deployment strategy against the full patent corpus
PatSnap Eureka provides AI-powered search across all 35+ patents in this analysis.
Deploying Quantized AI on Edge Chips — Key Questions Answered
The canonical pipeline involves training a floating-point network on a PC host, converting it to a fixed-point embedded model using per-layer quantization formulas, preprocessing quantization data, and then executing all accelerated operators of each network layer in hardware mode on an embedded AI accelerator. The result is reduced model storage footprint, accelerated inference, improved embedded device compute density, and lower operational power consumption.
Quantization-aware training (QAT) injects quantization error into the training graph before deployment, yielding a model that reduces floating-point operation counts and lowers cost on edge hardware. Post-training static quantization (PTQ) with sensitivity-guided mixed precision evaluates both pruning sensitivity and quantization sensitivity layer-by-layer, applies MinMax calibration to determine activation and weight ranges, and assigns lower bit-widths to insensitive blocks while preserving higher precision on sensitive modules — using the PTQ formula: quantized weight = round(scale × weight + zero_point), clipped to the target bit range.
Models trained in mainstream frameworks (PyTorch, TensorFlow) cannot be directly executed on vendor-specific NPUs or ASICs because the chip toolchains support only a subset of operators, and their computational semantics often diverge from training-time equivalents. Naive one-to-one operator mapping introduces redundant intermediate computation nodes, degrades quantization accuracy, and can produce incorrect inference results.
Feature extraction operators are assigned to the NPU, classification operators are offloaded to the multi-core CPU, on-chip SRAM is managed through a virtual memory paging mechanism, and a Direct Memory Access (DMA) controller implements zero-copy data transfers — eliminating redundant memory copies between processing stages. This hardware-aware partitioning maximizes edge chip utilization.
Yes. A graph neural network-based decision engine can monitor multi-dimensional resource state vectors and trigger model version switches (e.g., from 4-bit to 8-bit quantization) when resource conditions change. Federated learning-trained quantization compensation models, integrated into FPGA acceleration circuits with dynamic voltage-frequency scaling (DVFS), enable automatic model version switching in response to resource pressure.
Profiling edge hardware into a device fingerprint vector (compute score, storage, memory, processor type) enables automatic matching to appropriate model compression strategies. For platforms without existing hardware profiles, custom Verilog/HLS logic can be synthesized on-demand via dynamic code generation, establishing a global virtual address space and implementing RDMA-based zero-copy data transfer between NPU and GPU.
Still have questions? Let PatSnap Eureka search the patent corpus for you.
Ask Eureka About Edge AI PatentsAccelerate Your Edge AI R&D with Patent Intelligence
Join 18,000+ innovators already using PatSnap Eureka to map quantization pipelines, identify operator compatibility gaps, and monitor competitor filings across NPU, ASIC, and FPGA edge platforms.
References
- Neural Network Model Real-Time Automatic Quantization Method and System — Shanghai Qigan Electronic Information Technology, 2021
- Neural Network Model Real-Time Automatic Quantization Method and System (Updated) — Shanghai Qigan Electronic Information Technology, 2024
- Model Training Method and Apparatus (QAT for Edge Image Processors) — Hangzhou Hikvision Digital Technology, 2024
- Model Deployment Method, Device, Apparatus, Chip, and Storage Medium (ASIC Full-Integer Quantization) — Zhuhai Yizhi Electronics, 2020
- A Single-Chip Computational Imaging Edge Reconstruction Method Based on End-to-End Sensitivity Analysis — Tongji University, 2025
- Model Deployment Method, Apparatus, Device, and Program Product (Ethos-N NPU Operator Co-Design) — Shenzhen Unilumin Technology, 2025
- A Model Deployment Method and Apparatus (Private Operator Substitution) — Hangzhou Hikvision Digital Technology, 2024
- Cross-Hardware Platform AI Algorithm Model Automatic Matching and Migration Method and System — Xi'an Tengkun Electronics, 2025
- AI Large Model Lightweight Deployment Method for Complex Scenes (I) — Xi'an Xingxun Intelligent Communication Technology, 2025
- AI Large Model Lightweight Deployment Method for Complex Scenes (II) — Xi'an Xingxun Intelligent Communication Technology, 2025
- On-Device Unified Inference-Training Pipeline of Hybrid Precision Forward-Backward Propagation — Qualcomm Incorporated, 2025
- Dynamic Model Switching Framework for AI Inference Optimization on Edge Devices — Beijing Kejie Technology, 2025
- AI Model Deployment and AI Computing Method, System (SoC MCU+FPGA Operator Dispatch) — Guangdong Gowin Semiconductor, 2024
- Inference Acceleration Method and Apparatus for Edge Devices (Sub-Block Quantization) — Hangzhou Hikvision Digital Technology, 2025
- Model Deployment Method, Apparatus, Device, Storage Medium, and Program Product — Chongqing Changan Automobile, 2025
- Model Weight Deployment Method on Processing-in-Memory Chips — China Mobile Research Institute, 2024
- An AI Model Adaptive Deployment Method, Apparatus, Device, and Medium (Device Fingerprint Vector) — State Grid Henan Information Communication, 2025
- Intelligent Edge Computing Platform with Machine Learning Capability — Fog Horn Systems, 2021
- Robot AI Model Dynamic Compression Method and Control System — Shanghai Sazhi Intelligent Technology, 2025
- Lightweight Deployment Method and System for Industrial Scene Large Models — China Railway High-Tech Industry, 2025
- Meat Quality Detection Model Compression Method and System Based on Edge Computing — Shandong Ruicheng Data Technology, 2025
- WIPO — World Intellectual Property Organization: PCT Filing Data and Edge AI Patent Trends
- IEEE — Institute of Electrical and Electronics Engineers: Edge AI and Embedded Systems Publications
- PyTorch — TorchScript Intermediate Representation Documentation
All patent data and technical claims on this page are sourced from the references above and from PatSnap's proprietary innovation intelligence platform. Patent analysis conducted via PatSnap Eureka.
PatSnap Eureka searches patents and research to answer instantly.