Book a demo

Cut patent&paper research from weeks to hours with PatSnap Eureka AI!

Try now

MoE inference cost: 30+ patents analyzed

Mixture-of-Experts Architecture Inference Cost — PatSnap Insights
AI & Machine Learning

Mixture-of-Experts (MoE) architectures cut per-token inference cost by activating only a fraction of model parameters on each forward pass. Analysis of more than 30 patents from Intel, Microsoft, Google, Qualcomm, and leading research universities reveals how sparse gating, predictive prefetching, and asynchronous parallelism together make hundred-billion-parameter models economically deployable.

PatSnap Insights Team Innovation Intelligence Analysts 12 min read
Share
Reviewed by the PatSnap Insights editorial team ·

The sparse gating bargain: more parameters, fewer FLOPs per token

MoE architecture reduces inference cost through conditional computation: instead of routing every input token through every weight in the network, a lightweight gating function selects only a small subset of specialist sub-networks — called experts — to process each token. The consequence is that total model capacity can scale into the hundreds of billions of parameters while per-token compute remains bounded, a trade-off described in a 2026 WO filing from HE, SHA as “sparse MoE models can vastly expand their number of parameters and improve performance, while keeping the computation costs” fixed.

30+
Patent filings analysed across 5 jurisdictions
3.6×
Reduction in active compute vs. total parameters in Mixtral-8x7B
5.9×
Inference speedup over naive offloading (Shanghai Jiao Tong University)
25%
Average reduction in expert activations without accuracy loss (Peking University)

A concrete illustration of this arithmetic appears in a 2024 CN filing from Peking University: in the Mixtral-8x7B model, each token can access 47 billion parameters in total, yet only 13 billion participate in computation during any single inference iteration. That roughly 3.6× reduction in active compute versus total parameters is the essential promise of top-K routing — controlling FLOPs without sacrificing model capacity.

In the Mixtral-8x7B MoE model, each token accesses 47 billion total parameters but activates only 13 billion during a single inference iteration, representing a roughly 3.6× reduction in active compute relative to total parameter count.

What is top-K routing in MoE?

Top-K routing is the gating mechanism that selects exactly K expert sub-networks to process each input token. The router produces a sparse encoding identifying destination experts; tokens are dispatched only to those experts; outputs are computed; and a sparse decoding step recombines results — all while bypassing computation at non-selected experts. Microsoft’s 2024 US patent “Sparse Encoding and Decoding at Mixture-of-Experts Layer” formalises this dispatch-and-collect process.

The gating mechanism need not be static. Microsoft’s 2024 US patent “Mixture-of-Experts Layer with Dynamic Gating” describes a system in which the number K of expert sub-models selected per token is allowed to differ across iterations, enabling the model to adaptively calibrate compute expenditure based on input complexity rather than using a fixed top-K throughout inference. Qualcomm’s 2025 JP and 2024 CN filings extend this further by combining a shared base model with a routing model and an ensemble integration step, enabling per-sample rather than per-class routing. Early sample-to-expert routing can bypass full base model execution entirely, and confident samples can exit classification before any expert involvement — compounding the per-token savings.

Figure 1 — MoE sparse activation: active vs. total parameters in Mixtral-8x7B
MoE sparse activation ratio — active vs. total parameters in Mixtral-8x7B inference 0B 10B 20B 30B 50B 47B Total parameters (accessible per token) 13B Active parameters (activated per token) Total model parameters Parameters activated per inference token
Mixtral-8x7B activates only 13B of its 47B total parameters per token — a 3.6× compute reduction that exemplifies the MoE inference cost bargain, as documented in Peking University’s 2024 CN patent filing.

“Sparse MoE models can vastly expand their number of parameters and improve performance, while keeping the computation costs bounded — the essential inference cost bargain of the architecture.”

Hardware-level optimisation: prefetching, caching, and quantization for memory-constrained deployment

Sparse activation creates a secondary engineering problem: although only a fraction of experts are active per token, the full set of expert weights must reside somewhere accessible, and loading them on-demand introduces significant I/O latency. The patent literature addresses this bottleneck through three complementary strategies — tiered buffer architectures, predictive prefetching, and mixed-precision quantization.

Tiered hot/cold expert buffering

Intel Corporation’s 2025 US patent “Methods and Apparatus for MoE Inference with Full and Partial Hot Expert Buffers” introduces a two-tier buffer architecture. A “full hot expert buffer” stores complete weights of frequently used experts for direct computation, while a “partial hot expert buffer” stores partial weights of moderately used experts for partially direct computation. Less frequently used experts remain in lower-cost storage and are loaded only when selected. This frequency-aware tiering avoids unnecessary high-bandwidth memory consumption while ensuring the most latency-sensitive experts are always ready. IBM’s 2025 CN filing “Function-Based Memory Hierarchy Activation” takes this further with a 3D processing-in-memory accelerator that maps expert sub-models to layers of in-memory compute units and uses hash-based routing functions to select which memory layers to activate — eliminating the traditional DRAM bandwidth bottleneck by performing computation directly within memory.

Intel Corporation’s MoE hot expert buffer architecture uses a two-tier system: a full hot expert buffer for frequently used experts and a partial hot expert buffer for moderately used experts, with rarely used experts stored in lower-cost memory — reducing effective memory bandwidth requirements per MoE inference call.

Predictive prefetching to hide I/O latency

UESTC’s 2026 CN patent proposes training a dedicated expert activation prediction model that learns the mapping from prompt data to expert activation patterns layer by layer. During inference, this predictor forecasts which experts will be needed across all layers for an incoming request, allowing all necessary experts to be preloaded in a single batch I/O operation rather than sequentially — eliminating the serialization between expert selection and expert execution that otherwise becomes a dominant latency bottleneck. Harbin Institute of Technology’s 2025 CN patent pairs a layer-level predictor with a token-level predictor and an LRU-managed cache, mirroring the ProMoE approach of predicting layer i+2 experts while computing layer i to overlap PCIe transfer with computation.

Shanghai Jiao Tong University’s 2026 CN patent takes a speculative approach: a lightweight quantized draft model predicts target model expert activations, and corresponding full-precision expert weights are asynchronously prefetched. This system achieves up to 5.9× inference speedup over naive offloading and 1.8× improvement over fine-grained offloading frameworks on models such as Phi-MoE. Peking University’s 2024 CN patent adds a sensitivity-based adaptive expert gating mechanism that dynamically adjusts how many experts are activated per layer per input, achieving an average 25% reduction in expert activations across inference runs without accuracy degradation.

Explore the full MoE patent landscape — routing, prefetching, and quantization innovations — with PatSnap Eureka.

Search MoE Patents in PatSnap Eureka →

Mixed-precision quantization tuned to activation frequency

Quantization provides an orthogonal compression axis. Xi’an Jiaotong University’s 2025 CN patent proposes mixed-precision quantization that assigns high-precision formats (BF16) to frequently activated experts and lower-precision formats (INT4 or INT8) to rarely activated ones — acknowledging that uniform quantization strategies fail to exploit the non-uniform activation frequency distribution of MoE expert pools. Muxi Integrated Circuit’s 2026 CN patent similarly implements nested weight quantization followed by dynamic bit-width selection at inference time, adding bit-width-aware reordering of active sub-models to improve execution efficiency. According to IEEE research on quantization-aware inference, non-uniform precision assignment consistently outperforms uniform quantization when activation distributions are skewed — precisely the condition that MoE’s top-K routing creates.

Figure 2 — MoE inference optimisation strategies and their primary cost targets
MoE inference cost reduction strategies — sparse gating, expert prefetching, mixed-precision quantization, and asynchronous parallelism Optimisation strategy Reported gain (from CONTENT) Sparse top-K routing vs. dense activation 3.6× fewer active params Speculative offload prefetch vs. naive offloading 5.9× speedup Adaptive expert gating vs. fixed top-K 25% fewer activations Mixed-precision quantization BF16 hot / INT4 cold experts Better accuracy-per-bit
Four patent-documented strategies for reducing MoE inference cost, ranked by the magnitude of their reported gains. Speculative offload prefetching delivers the largest single-system speedup (5.9×) while adaptive gating provides a 25% compute reduction without accuracy loss.

Distributed execution and parallelism at frontier scale: masking the all-to-all overhead

At frontier scale — models such as DeepSeek-R1 with 671 billion parameters — no single accelerator can hold all expert weights in GPU memory, making distributed expert parallelism essential. The MoE routing mechanism introduces a distinctive communication pattern: tokens must be dispatched across GPU boundaries to reach the GPU holding the selected expert, and results must be gathered back. This all-to-all communication is a major source of overhead in distributed MoE inference, and patents from Chinese universities and Microsoft address it from multiple angles.

DeepSeek-R1 has 671 billion parameters, requiring distributed expert parallelism because no single accelerator can hold all expert weights in GPU memory. The MoE routing mechanism forces tokens to cross GPU boundaries to reach selected experts, making all-to-all communication a primary source of inference overhead at this scale.

The University of Science and Technology of China’s 2025 CN patent “Asynchronous Parallel Inference Method for MoE Models” addresses this by decoupling GPU computation from the all-to-all collective communication inherent in expert parallelism. Token-data communication proceeds asynchronously and in parallel with model computation, masking communication latency and eliminating synchronization wait overheads. Additionally, “hot” experts are co-located on GPU memory while “cold” experts are offloaded to CPU, enabling larger batch sizes and improved GPU utilization.

Hangzhou Dianzi University’s 2025 CN patent decomposes the global AllGather communication used in sequence-parallel MoE training into per-expert AllGather operations, then pipelines each expert’s AllGather with the preceding expert’s computation. Since token computations for different experts are independent, this overlap is mathematically valid and significantly reduces idle compute time. A companion patent from the same institution uses a load model built from sampled token features, expert parameters, and memory/compute resource data to drive expert scheduling that minimises all-to-all communication delay and dynamically adjusts expert capacity values.

Microsoft’s 2024 US patent “Mixture-of-Experts Layer with Switchable Parallel Modes” solves the parallelism-mode switching problem: the MoE layer can toggle between data parallel and expert-data-model parallel modes without transferring expert sub-model weights between processing devices, reducing the overhead of adapting to varying batch size and throughput conditions at runtime. Intel’s 2025 US patent “Edge Deployment of a Mixture of Experts Architecture” extends distributed MoE to heterogeneous edge environments, launching selected expert models across multiple edge nodes with dynamic instance scaling based on service level agreements or trends detected in input data. As noted by WIPO in its analysis of AI patent trends, distributed inference architectures have become one of the fastest-growing sub-categories in AI hardware and systems filings.

Key finding: async parallelism masks the all-to-all bottleneck

The University of Science and Technology of China’s asynchronous parallel inference patent demonstrates that decoupling token dispatch communication from GPU computation is the critical design move for maintaining high GPU utilisation in distributed MoE deployments — hot experts remain on GPU memory while cold experts are offloaded to CPU, enabling larger effective batch sizes without stalling compute pipelines.

Track distributed MoE patent filings from Intel, Microsoft, and Chinese universities in real time with PatSnap Eureka.

Explore MoE Patent Intelligence in PatSnap Eureka →

Converting and fine-tuning existing models for MoE cost savings without training from scratch

MoE inference cost reduction does not require building a model from scratch — a growing body of patents addresses how to retrofit MoE savings onto existing dense model assets, or how to expand deployed MoE models incrementally.

Samsung SDS’s 2025 US and KR patents describe converting a dense pre-trained language model into an MoE architecture by extracting the feed-forward network (FFN) from each transformer layer, replacing it with an MoE block, and then optimizing the MoE block weights to match the original FFN output on training data. This approach recovers the inference cost savings of MoE without requiring training from scratch — a significant advantage for organisations with large existing dense model investments. According to PatSnap’s innovation intelligence research, dense-to-sparse model conversion is an emerging category in the broader AI efficiency patent landscape.

Baidu’s 2025 CN patent “MoE Fine-Tuning Method” applies Low-Rank Adaptation (LoRA) consistently to all linear layers of both expert modules and shared non-MoE modules, using differentiated low-bit quantization for each module type to reduce memory footprint during fine-tuning. The distributed training framework supports expert parallelism, allowing fine-tuning of very large MoE models within feasible memory budgets while maintaining model performance.

Google LLC’s 2025 CN patent “Lifelong Pre-Training of MoE Neural Networks” describes a continual training framework in which new expert sub-networks are added to an existing MoE model while previously trained expert parameters are frozen. This partial expansion avoids full model retraining when incorporating new data distributions, saves compute resources relative to training from scratch, and allows the model to be deployed on edge devices with only small incremental updates when new capabilities are required.

Robert Bosch’s 2026 US patent provides a systematic methodology for determining which layers of an existing neural network are best replaced with MoE blocks: candidate architectures with surrogate MoE layers at different positions are trained and validated against ground truth, and the configuration yielding the best accuracy is selected as the optimal insertion point. This layer-selection procedure maximises the accuracy-per-FLOP benefit of sparse MoE gating. Guidance from NIST on AI model evaluation frameworks increasingly emphasises this kind of systematic architecture search as a best practice for responsible AI deployment.

Samsung SDS’s 2025 patents describe converting a dense pre-trained language model into an MoE architecture by extracting the feed-forward network (FFN) from each transformer layer, replacing it with an MoE block, and optimising the MoE block weights to match the original FFN output — recovering MoE inference cost savings without training from scratch.

Who holds the MoE patent landscape: leading assignees and innovation trends

The MoE patent dataset of more than 30 filings spans the United States, China, Japan, South Korea, and the European Union, with a clear concentration of activity in inference-time optimisation rather than training-time efficiency — reflecting the maturation of MoE from a training-efficiency tool into a deployment-efficiency infrastructure.

Intel Corporation holds two detailed US filings covering distributed edge deployment and hardware-level buffer management for expert weight access. Microsoft Technology Licensing LLC holds three US filings addressing gating dynamics, sparse dispatch/collect, and parallelism mode switching. Qualcomm holds Chinese and Japanese family members covering per-sample routing, early exit, and ensemble integration. Google LLC focuses on continual learning and incremental model expansion via a CN filing. Samsung SDS holds both Korean and US family filings on dense-to-MoE model conversion.

Chinese research universities collectively represent the most active national cluster in MoE inference optimisation. Shanghai Jiao Tong University holds two filings on speculative offloading and expert access prediction. Hangzhou Dianzi University holds two filings on load balancing and communication optimisation. Peking University, Harbin Institute of Technology, UESTC, and the University of Science and Technology of China each contribute filings with strong focus on memory-constrained and distributed deployment scenarios. This concentration aligns with broader trends documented by WIPO, which has tracked a significant rise in Chinese AI infrastructure patent filings since 2022.

Figure 3 — MoE patent filings by assignee category (dataset of 30+ filings)
MoE inference cost reduction patent filings by assignee category — Chinese universities, US technology corporations, Korean companies 0 3 6 9 12 12 Chinese Universities 3 Microsoft 2 Intel 2 Qualcomm 5 Others (Google, Samsung…) Chinese Universities Microsoft Intel Qualcomm Others (Google, Samsung SDS, IBM, Bosch, Baidu)
Chinese research universities collectively account for the largest share of MoE inference optimisation patents in the dataset, with a strong focus on memory-constrained and distributed deployment scenarios. US corporations lead in hardware-level and edge deployment filings.

A clear trend across the dataset is the increasing focus on inference-time optimisation — prefetching, caching, quantization, and async parallelism — rather than solely training-time efficiency. This reflects the maturation of MoE from a training-efficiency tool into a deployment-efficiency infrastructure, a shift that has significant implications for IP strategy at organisations building or deploying frontier AI systems. The PatSnap innovation intelligence platform tracks these filing trends in real time across all major jurisdictions.

Frequently asked questions

Mixture-of-Experts MoE architecture — key questions answered

Still have questions? Let PatSnap Eureka answer them for you.

Ask PatSnap Eureka for a Deeper Answer →

References

  1. Operating Mixture-of-Experts (MoE) Architectures Using Neural Processing Units — HE, SHA, 2026 (WO)
  2. A Mixed Expert Model Inference Method — Peking University, 2024 (CN)
  3. Mixture-of-Experts Layer with Dynamic Gating — Microsoft Technology Licensing LLC, 2024 (US)
  4. Sparse Encoding and Decoding at Mixture-of-Experts Layer — Microsoft Technology Licensing LLC, 2024 (US)
  5. Mixture-of-Experts Layer with Switchable Parallel Modes — Microsoft Technology Licensing LLC, 2024 (US)
  6. Machine Learning Model Architecture Combining Mixture of Experts and Model Ensembling — Qualcomm, 2025 (JP)
  7. Machine Learning Model Architecture Combining Mixture of Experts and Model Ensembling — Qualcomm, 2024 (CN)
  8. Methods and Apparatus for MoE Inference with Full and Partial Hot Expert Buffers — Intel Corporation, 2025 (US)
  9. Edge Deployment of a Mixture of Experts (MoE) Architecture — Intel Corporation, 2025 (US)
  10. An MoE Large Language Model Inference Acceleration Method for Resource-Constrained Devices — UESTC, 2026 (CN)
  11. MoE Large Model Inference Optimization System and Method for Memory-Constrained Devices Based on Dual Prediction — Harbin Institute of Technology, 2025 (CN)
  12. A Speculative and Offload-Based MoE Large Model Inference System — Shanghai Jiao Tong University, 2026 (CN)
  13. Expert Access Prediction Method and System for MoE Large Language Models — Shanghai Jiao Tong University, 2025 (CN)
  14. An Optimization Method for Deploying MoE Models in Edge Computing Environments — Xi’an Jiaotong University, 2025 (CN)
  15. A Dynamic Scheduling Inference Method Based on MoE Models — Muxi Integrated Circuit (Shanghai) Co. Ltd., 2026 (CN)
  16. Function-Based Memory Hierarchy Activation — IBM, 2025 (CN)
  17. Asynchronous Parallel Inference Method for MoE Models — University of Science and Technology of China, 2025 (CN)
  18. A Communication Optimization Method for MoE Models Based on Expert Load Prediction and Sequence Parallelism — Hangzhou Dianzi University, 2025 (CN)
  19. A Dynamic Load Balancing Method for Distributed Training of MoE Models — Hangzhou Dianzi University, 2024 (CN)
  20. Lifelong Pre-Training of MoE Neural Networks — Google LLC, 2025 (CN)
  21. Method for Converting Trained Language Model into Language Model Having MoE Architecture — Samsung SDS Co. Ltd., 2025 (US)
  22. Method for Converting a Trained Language Model into a Language Model Having a Structure of Mixture of Experts — Samsung SDS Co. Ltd., 2025 (KR)
  23. MoE Fine-Tuning Method — Baidu, 2025 (CN)
  24. Optimizing MoE Integration into Neural Network Architectures — Robert Bosch, 2026 (US)
  25. WIPO — World Intellectual Property Organization (AI patent trend data)
  26. IEEE — Institute of Electrical and Electronics Engineers (quantization-aware inference research)
  27. NIST — National Institute of Standards and Technology (AI model evaluation frameworks)

All data and statistics in this article are sourced from the references above and from PatSnap‘s proprietary innovation intelligence platform.

Your Agentic AI Partner
for Smarter Innovation

PatSnap fuses the world’s largest proprietary innovation dataset with cutting-edge AI to
supercharge R&D, IP strategy, materials science, and drug discovery.

Book a demo