Book a demo

MoE inference cost cuts: 30+ patents analyzed

Mixture-of-Experts Architecture Inference Cost — PatSnap Insights
AI & Machine Learning

Mixture-of-Experts (MoE) architecture cuts frontier AI inference cost by activating only a fraction of model parameters per token — a patent analysis of 30+ filings from Intel, Microsoft, Google, Qualcomm, and leading Chinese research universities reveals how sparse routing, expert prefetching, and async parallelism are making hundred-billion-parameter models economically deployable.

PatSnap Insights Team Innovation Intelligence Analysts 11 min read
Share
Reviewed by the PatSnap Insights editorial team ·

The inference cost bargain: sparse gating and conditional computation

Mixture-of-Experts architecture reduces AI inference cost through a mechanism called conditional computation: rather than passing every input token through every parameter in the network — as dense models do — a lightweight router selects only a small subset of specialist sub-networks, called experts, to process each token. The arithmetic is striking. As documented in patent filings analysed for this piece, in the Mixtral-8x7B model each token can access 47 billion parameters in total, yet only 13 billion participate in computation during any single inference iteration — a roughly 3.6× reduction in active compute versus total parameters, achieved without sacrificing model capacity.

30+
MoE patent filings analysed across 5 jurisdictions
3.6×
Reduction in active compute vs total parameters (Mixtral-8x7B)
5.9×
Inference speedup over naive offloading (Shanghai Jiao Tong University)
25%
Reduction in expert activations without accuracy loss (Peking University)

This is the essential MoE cost bargain: more total parameters for higher model capacity, but far fewer activated per forward pass, keeping per-token FLOPs bounded. According to PatSnap’s patent intelligence research, this principle now underpins more than 30 filings across the United States, China, Japan, South Korea, and the European Union — a body of innovation that spans gating algorithms, memory management, distributed execution, and model conversion.

What is top-K routing in MoE?

Top-K routing is the gating mechanism by which a lightweight function selects exactly K expert sub-networks to process each input token, dispatching the token only to those K experts and ignoring all others. By keeping K small relative to the total number of experts, top-K routing ensures that per-token compute scales with K rather than with total expert count, enabling sublinear inference cost relative to total model parameter size.

The gating mechanism itself can be implemented with dynamic flexibility. Microsoft Technology Licensing’s 2024 US patent on dynamic gating describes a system in which the number k of expert sub-models selected per token is allowed to differ across iterations, enabling the model to adaptively calibrate compute expenditure based on input complexity — avoiding over-computation on simpler inputs while preserving full capacity for complex ones. A complementary Microsoft filing formalises the dispatch-and-collect process: the gating function produces a sparse encoding identifying destination expert sub-models, tokens are dispatched to those experts only, outputs are computed, and a sparse decoding step recombines results — all while avoiding computation at non-selected experts.

In the Mixtral-8x7B model, each token can access 47 billion parameters in total, yet only 13 billion parameters participate in computation during any single inference iteration, representing a roughly 3.6× reduction in active compute versus total parameters — a direct result of MoE sparse top-K routing.

Qualcomm’s filings (both Chinese and Japanese family members) extend this further by combining a shared base model with a routing model and ensemble integration, enabling per-sample rather than per-class routing. Their architecture also supports early-exit classification for samples that do not require expert involvement at all, further reducing computational cost for straightforward inputs. According to WIPO, patent filings in AI hardware and inference optimization have grown substantially year-on-year, reflecting the commercial urgency of reducing the cost of deploying large language models.

Figure 1 — MoE sparse activation: active vs total parameters per inference token
MoE sparse activation ratio: active parameters vs total parameters per inference token in Mixtral-8x7B 0B 12B 24B 36B 48B 47B Total Parameters (Mixtral-8x7B) 13B Active per Token (top-K routing) Total model parameters Active per inference token 3.6× reduction
Mixtral-8x7B holds 47B total parameters but activates only 13B per token via top-K routing — a 3.6× reduction in active compute that keeps inference cost sublinear relative to model size, as documented in Peking University’s 2024 patent filing.

“Sparse MoE models can vastly expand their number of parameters and improve performance, while keeping the computation costs bounded — the essential inference cost bargain of the MoE architecture.”

Hardware-level optimization: prefetching, caching, offloading, and quantization

Sparse activation creates a secondary engineering challenge: although only a fraction of experts are active per token, the full set of expert weights must reside somewhere accessible, and loading them on-demand introduces significant I/O latency. Multiple patents address this bottleneck through tiered buffer architectures, predictive preloading, and mixed-precision quantization.

Intel Corporation’s 2025 US patent introduces a two-tier buffer architecture for MoE inference: a “full hot expert buffer” stores complete weights of frequently used experts for direct computation, while a “partial hot expert buffer” stores partial weights of moderately used experts for partially direct computation. Less frequently used experts remain in lower-cost storage and are loaded only when selected. This frequency-aware tiering avoids unnecessary high-bandwidth memory consumption while ensuring that the most latency-sensitive experts are always ready for immediate access.

Intel Corporation’s MoE inference patent describes a two-tier hot expert buffer architecture in which frequently used experts are stored in full in high-bandwidth memory for direct computation, while moderately used experts are stored partially, and rarely used experts are loaded from lower-cost storage only when selected by the router.

Predictive preloading takes this further. UESTC’s 2026 CN filing proposes training a dedicated expert activation prediction model that learns the mapping from prompt data to expert activation patterns layer by layer. During inference, this predictor forecasts which experts will be needed across all layers for an incoming request, allowing all necessary experts to be preloaded in a single batch I/O operation rather than sequentially — eliminating the serialization between expert selection and expert execution that otherwise becomes a dominant latency bottleneck.

Harbin Institute of Technology’s dual-prediction system pairs a layer-level predictor with a token-level predictor, driving an I/O scheduler that prefetches experts predictively while maintaining an LRU-based cache. The two-stage architecture mirrors the ProMoE approach of predicting layer i+2 experts while computing layer i, overlapping PCIe transfer with computation. Shanghai Jiao Tong University’s speculative offload-based system uses a lightweight quantized draft model to predict target model expert activations, then asynchronously prefetches the corresponding full-precision expert weights — achieving up to 5.9× inference speedup over naive offloading and 1.8× improvement over fine-grained offloading frameworks on models such as Phi-MoE. Research from IEEE on memory-bound neural network inference corroborates that I/O latency, not raw compute, is the dominant bottleneck in sparse model deployment on resource-constrained hardware.

Explore the full MoE patent landscape — routing algorithms, buffer architectures, and quantization strategies — in PatSnap Eureka.

Explore MoE Patents in PatSnap Eureka →

Peking University’s 2024 CN filing adds a sensitivity-based adaptive expert gating mechanism that dynamically adjusts how many experts are activated per layer per input, combined with adaptive expert prefetching and caching. The result is an average 25% reduction in expert activations across inference runs without accuracy degradation — a meaningful compute saving particularly in memory-limited edge deployments.

Quantization provides an orthogonal compression axis. Xi’an Jiaotong University’s optimization method for edge computing environments proposes mixed-precision quantization that assigns high-precision formats (BF16) to frequently activated experts and lower-precision formats (INT4 or INT8) to rarely activated ones. This exploits the non-uniform activation frequency distribution of MoE expert pools — a structural property that uniform quantization strategies fail to leverage. Muxi Integrated Circuit’s 2026 CN filing similarly implements nested weight quantization followed by dynamic bit-width selection at inference time, adding bit-width-aware reordering of active sub-models to further improve execution efficiency.

Key finding: mixed-precision quantization exploits MoE’s activation asymmetry

MoE expert pools have non-uniform activation frequency distributions — some experts are selected far more often than others. Assigning BF16 precision to frequently activated experts and INT4/INT8 to rarely activated ones provides better accuracy-per-bit than uniform quantization, as established in Xi’an Jiaotong University’s 2025 patent filing on MoE deployment in edge computing environments.

IBM’s 2025 CN filing describes a 3D processing-in-memory accelerator for MoE inference in which expert sub-models are mapped to layers of in-memory compute units, and hash-based routing functions select which memory layers to activate. This architecture eliminates the traditional DRAM bandwidth bottleneck by performing computation directly within memory, enabling fast and energy-efficient inference on models with billions of parameters — a direction aligned with broader semiconductor research trends documented by Nature on in-memory computing for AI workloads.

Figure 2 — MoE inference optimization techniques and their primary cost-reduction targets
MoE inference optimization techniques: expert prefetching, tiered buffering, mixed-precision quantization, and speculative offloading mapped to their cost-reduction targets Optimization Technique → Primary Cost Target TECHNIQUE COST TARGET KEY METRIC Sparse top-K routing Microsoft, Qualcomm, HE SHA Per-token FLOPs 3.6× fewer active params Expert prefetching & caching UESTC, Harbin IT, Shanghai Jiao Tong I/O latency Up to 5.9× speedup Tiered hot/cold buffering Intel Corporation Memory bandwidth Frequency-aware tiering Mixed-precision quantization Xi’an Jiaotong, Muxi IC Model weight size BF16 / INT4 per frequency Adaptive gating (Peking Univ.) Expert activation count 25% fewer activations
Five distinct hardware-level optimization techniques target different cost bottlenecks in MoE inference — from per-token FLOPs (sparse routing) to I/O latency (prefetching) to memory bandwidth (tiered buffering) — as documented across patent filings from Intel, Microsoft, UESTC, and Chinese research universities.

Distributed execution and parallelism at frontier scale

At frontier scale — models such as DeepSeek-R1 with 671 billion parameters — no single accelerator can hold all expert weights in GPU memory, making distributed expert parallelism essential. The MoE routing mechanism introduces a distinctive communication pattern: tokens must be dispatched across GPU boundaries to reach the GPU holding the selected expert, and results must be gathered back. This all-to-all communication is a major source of overhead in distributed MoE inference, and multiple patent filings from 2024–2025 address it directly.

The University of Science and Technology of China’s asynchronous parallel inference method for MoE models decouples GPU computation from all-to-all collective communication inherent in expert parallelism, allowing token-data communication to proceed asynchronously and in parallel with model computation, thereby masking communication latency and eliminating synchronization wait overheads in distributed MoE deployments.

The University of Science and Technology of China’s 2025 CN patent addresses all-to-all overhead by decoupling GPU computation from collective communication inherent in expert parallelism. Token-data communication proceeds asynchronously and in parallel with model computation, masking communication latency and eliminating synchronization wait overheads. Additionally, “hot” experts (frequently selected) are co-located on GPU memory while “cold” experts (rarely selected) are offloaded to CPU, enabling larger batch sizes and improved GPU utilization.

Hangzhou Dianzi University’s communication optimization patent decomposes the global AllGather communication used in sequence-parallel MoE training into per-expert AllGather operations, then pipelines each expert’s AllGather with the preceding expert’s computation. Since token computations for different experts are independent, this overlap is mathematically valid and significantly reduces idle compute time. A related patent from the same institution uses a load model built from sampled token features, expert parameters, and memory/compute resource data to drive expert scheduling that minimises all-to-all communication delay and dynamically adjusts expert capacity values.

Intel’s edge deployment patent targets distributed edge scenarios, launching selected expert models across multiple edge nodes with dynamic instance scaling based on service level agreements or trends detected in the input data. Pre- and post-processing are also distributed, with different versions of input data adapted to each expert model’s input format — extending MoE cost-reduction benefits from cloud data centres to heterogeneous edge environments. The importance of such edge-to-cloud deployment flexibility is highlighted in OECD analysis on AI infrastructure costs and the economics of model deployment at scale.

Microsoft Technology Licensing’s switchable parallel modes patent solves the parallelism-mode switching problem: the MoE layer can toggle between data parallel and expert-data-model parallel modes without transferring expert sub-model weights between processing devices, reducing the overhead of adapting to varying batch size and throughput conditions at runtime.

Search distributed MoE inference patents from Intel, Microsoft, and Chinese research universities with PatSnap Eureka’s AI-powered patent search.

Search MoE Inference Patents in PatSnap Eureka →

Continual learning, model conversion, and fine-tuning under MoE

Beyond inference optimization, several disclosures target the cost of expanding or adapting existing MoE models — reflecting a maturing understanding that inference cost reduction must also encompass the lifecycle costs of updating and fine-tuning deployed models.

Google LLC’s 2025 CN filing on lifelong pre-training of MoE neural networks describes a continual training framework in which new expert sub-networks are added to an existing MoE model while previously trained expert parameters are frozen. This partial expansion avoids full model retraining when incorporating new data distributions, saves compute resources relative to training from scratch, and allows the model to be deployed on edge devices with only small incremental updates when new capabilities are required.

Samsung SDS holds both Korean and US family filings on dense-to-MoE model conversion. The method extracts the feed-forward network (FFN) from each transformer layer of a dense pre-trained model and replaces it with an MoE block, then optimises the MoE block weights to match the original FFN output on training data. This approach recovers the inference cost savings of MoE without requiring training from scratch — making adoption viable for organisations with large existing dense model investments. This reflects an industry trend toward retrofitting MoE cost savings onto existing dense model assets, a pattern also noted in research published by arXiv on post-hoc MoE conversion techniques.

Samsung SDS’s dense-to-MoE model conversion method extracts the feed-forward network (FFN) from each transformer layer of a dense pre-trained language model, replaces it with an MoE block, and optimises the MoE block weights to match the original FFN output on training data — recovering MoE inference cost savings without requiring training from scratch.

Baidu’s 2025 CN MoE fine-tuning patent applies Low-Rank Adaptation (LoRA) consistently to all linear layers of both expert modules and shared non-MoE modules, using differentiated low-bit quantization for each module type to reduce memory footprint during fine-tuning. The distributed training framework supports expert parallelism, allowing fine-tuning of very large MoE models within feasible memory budgets while maintaining model performance.

Robert Bosch’s 2026 US filing provides a systematic methodology for determining which layers of an existing neural network are best replaced with MoE blocks: candidate architectures with surrogate MoE layers at different positions are trained and validated against ground truth, and the configuration yielding the best accuracy is selected as the optimal insertion point. This layer-selection procedure maximises the accuracy-per-FLOP benefit of sparse MoE gating — a practical tool for engineers integrating MoE into existing production architectures. PatSnap’s PatSnap Eureka platform tracks the full family of such conversion and fine-tuning patents across all major jurisdictions.

Key players and innovation trends across 30+ patents

Based on the frequency and technical depth of relevant disclosures across the 30+ patent filings analysed, the leading innovators in MoE inference cost reduction fall into three clusters: US technology companies, Chinese research universities, and Korean industrial groups.

US technology companies

Intel Corporation holds two highly detailed US filings covering distributed edge deployment and hardware-level buffer management for expert weight access. Microsoft Technology Licensing, LLC holds three US filings addressing gating dynamics, sparse dispatch/collect, and parallelism mode switching — the most comprehensive single-assignee portfolio in the dataset for gating and parallelism. Google LLC (via a CN filing) focuses on continual learning and incremental model expansion, reflecting interest in long-term maintenance cost reduction. Qualcomm holds both Chinese and Japanese family members covering per-sample routing, early exit, and ensemble integration.

Chinese research universities

Chinese research universities collectively represent the most active national cluster in MoE inference optimization in this dataset. Shanghai Jiao Tong University holds two filings on speculative offloading and expert access prediction. Hangzhou Dianzi University holds two filings on load balancing and communication optimization. Peking University, Harbin Institute of Technology, UESTC, and the University of Science and Technology of China each contribute filings with strong focus on memory-constrained and distributed deployment scenarios.

Korean industrial groups and emerging Chinese industry

Samsung SDS holds both Korean and US family filings on dense-to-MoE model conversion. Baidu and Muxi Integrated Circuit contribute filings on quantized LoRA fine-tuning and dynamic scheduling respectively, reflecting the maturation of Chinese AI hardware companies into the MoE inference space. IBM and Robert Bosch represent enterprise and industrial perspectives on in-memory computing and systematic MoE layer insertion.

A clear trend across the dataset is the increasing focus on inference-time optimization — prefetching, caching, quantization, async parallelism — rather than solely training-time efficiency. This reflects the maturation of MoE from a training-efficiency tool into a deployment-efficiency infrastructure, a transition that mirrors the broader shift in AI research priorities documented by organisations such as WIPO in their annual Global Innovation Index reports on AI patent activity.

Figure 3 — MoE patent filings by assignee type and technical focus area
MoE inference cost reduction patent filings by key assignee across hardware, routing, distributed, and conversion categories 0 1 2 3 3 Microsoft 2 Intel 2 Qualcomm 2 Samsung SDS 2 Hangzhou DU 2 Shanghai JTU 1 Google LLC No. of filings
Microsoft Technology Licensing leads with 3 US filings on gating, sparse dispatch, and parallelism modes; Intel, Qualcomm, Samsung SDS, Hangzhou Dianzi University, and Shanghai Jiao Tong University each hold 2 filings — reflecting both US corporate and Chinese academic leadership in MoE inference patent activity.

“A clear trend across the dataset is the increasing focus on inference-time optimization rather than solely training-time efficiency — reflecting the maturation of MoE from a training-efficiency tool into a deployment-efficiency infrastructure.”

Frequently asked questions

Mixture-of-Experts architecture inference cost — key questions answered

Still have questions? Let PatSnap Eureka answer them for you.

Ask PatSnap Eureka for a Deeper Answer →

References

  1. Operating Mixture-of-Experts (MoE) Architectures Using Neural Processing Units — HE, SHA (WO, 2026)
  2. A Mixed Expert Model Inference Method — Peking University (CN, 2024)
  3. Mixture-of-Experts Layer with Dynamic Gating — Microsoft Technology Licensing, LLC (US, 2024)
  4. Sparse Encoding and Decoding at Mixture-of-Experts Layer — Microsoft Technology Licensing, LLC (US, 2024)
  5. Mixture-of-Experts Layer with Switchable Parallel Modes — Microsoft Technology Licensing, LLC (US, 2024)
  6. Machine Learning Model Architecture Combining Mixture of Experts and Model Ensembling — Qualcomm (JP, 2025)
  7. Machine Learning Model Architecture Combining Mixture of Experts and Model Ensembling — Qualcomm (CN, 2024)
  8. Methods and Apparatus for MoE Inference with Full and Partial Hot Expert Buffers — Intel Corporation (US, 2025)
  9. Edge Deployment of a Mixture of Experts (MoE) Architecture — Intel Corporation (US, 2025)
  10. An MoE Large Language Model Inference Acceleration Method for Resource-Constrained Devices — UESTC (CN, 2026)
  11. MoE Large Model Inference Optimization System and Method for Memory-Constrained Devices Based on Dual Prediction — Harbin Institute of Technology (CN, 2025)
  12. A Speculative and Offload-Based MoE Large Model Inference System — Shanghai Jiao Tong University (CN, 2026)
  13. Expert Access Prediction Method and System for MoE Large Language Models — Shanghai Jiao Tong University (CN, 2025)
  14. An Optimization Method for Deploying MoE Models in Edge Computing Environments — Xi’an Jiaotong University (CN, 2025)
  15. A Dynamic Scheduling Inference Method Based on MoE Models — Muxi Integrated Circuit (Shanghai) Co., Ltd. (CN, 2026)
  16. Function-Based Memory Hierarchy Activation — IBM (CN, 2025)
  17. Asynchronous Parallel Inference Method for MoE Models — University of Science and Technology of China (CN, 2025)
  18. A Communication Optimization Method for MoE Models Based on Expert Load Prediction and Sequence Parallelism — Hangzhou Dianzi University (CN, 2025)
  19. A Dynamic Load Balancing Method for Distributed Training of MoE Models — Hangzhou Dianzi University (CN, 2024)
  20. Lifelong Pre-Training of MoE Neural Networks — Google LLC (CN, 2025)
  21. Method for Converting Trained Language Model into Language Model Having MoE Architecture — Samsung SDS Co., Ltd. (US, 2025)
  22. Method for Converting a Trained Language Model into a Language Model Having a Structure of Mixture of Experts — Samsung SDS Co., Ltd. (KR, 2025)
  23. WIPO — World Intellectual Property Organization: Global Innovation Index and AI Patent Trends
  24. IEEE — Institute of Electrical and Electronics Engineers: Research on memory-bound neural network inference
  25. Nature — In-memory computing for AI workloads
  26. OECD — AI infrastructure costs and the economics of model deployment at scale
  27. arXiv — Post-hoc MoE conversion techniques for dense language models

All data and statistics in this article are sourced from the references above and from PatSnap‘s proprietary innovation intelligence platform.

Your Agentic AI Partner
for Smarter Innovation

PatSnap fuses the world’s largest proprietary innovation dataset with cutting-edge AI to
supercharge R&D, IP strategy, materials science, and drug discovery.

Book a demo