The sparse gating bargain: more parameters, fewer FLOPs per token
MoE architecture reduces inference cost through conditional computation: instead of routing every input token through every weight in the network, a lightweight gating function selects only a small subset of specialist sub-networks — called experts — to process each token. The consequence is that total model capacity can scale into the hundreds of billions of parameters while per-token compute remains bounded, a trade-off described in a 2026 WO filing from HE, SHA as “sparse MoE models can vastly expand their number of parameters and improve performance, while keeping the computation costs” fixed.
A concrete illustration of this arithmetic appears in a 2024 CN filing from Peking University: in the Mixtral-8x7B model, each token can access 47 billion parameters in total, yet only 13 billion participate in computation during any single inference iteration. That roughly 3.6× reduction in active compute versus total parameters is the essential promise of top-K routing — controlling FLOPs without sacrificing model capacity.
In the Mixtral-8x7B MoE model, each token accesses 47 billion total parameters but activates only 13 billion during a single inference iteration, representing a roughly 3.6× reduction in active compute relative to total parameter count.
Top-K routing is the gating mechanism that selects exactly K expert sub-networks to process each input token. The router produces a sparse encoding identifying destination experts; tokens are dispatched only to those experts; outputs are computed; and a sparse decoding step recombines results — all while bypassing computation at non-selected experts. Microsoft’s 2024 US patent “Sparse Encoding and Decoding at Mixture-of-Experts Layer” formalises this dispatch-and-collect process.
The gating mechanism need not be static. Microsoft’s 2024 US patent “Mixture-of-Experts Layer with Dynamic Gating” describes a system in which the number K of expert sub-models selected per token is allowed to differ across iterations, enabling the model to adaptively calibrate compute expenditure based on input complexity rather than using a fixed top-K throughout inference. Qualcomm’s 2025 JP and 2024 CN filings extend this further by combining a shared base model with a routing model and an ensemble integration step, enabling per-sample rather than per-class routing. Early sample-to-expert routing can bypass full base model execution entirely, and confident samples can exit classification before any expert involvement — compounding the per-token savings.
“Sparse MoE models can vastly expand their number of parameters and improve performance, while keeping the computation costs bounded — the essential inference cost bargain of the architecture.”
Hardware-level optimisation: prefetching, caching, and quantization for memory-constrained deployment
Sparse activation creates a secondary engineering problem: although only a fraction of experts are active per token, the full set of expert weights must reside somewhere accessible, and loading them on-demand introduces significant I/O latency. The patent literature addresses this bottleneck through three complementary strategies — tiered buffer architectures, predictive prefetching, and mixed-precision quantization.
Tiered hot/cold expert buffering
Intel Corporation’s 2025 US patent “Methods and Apparatus for MoE Inference with Full and Partial Hot Expert Buffers” introduces a two-tier buffer architecture. A “full hot expert buffer” stores complete weights of frequently used experts for direct computation, while a “partial hot expert buffer” stores partial weights of moderately used experts for partially direct computation. Less frequently used experts remain in lower-cost storage and are loaded only when selected. This frequency-aware tiering avoids unnecessary high-bandwidth memory consumption while ensuring the most latency-sensitive experts are always ready. IBM’s 2025 CN filing “Function-Based Memory Hierarchy Activation” takes this further with a 3D processing-in-memory accelerator that maps expert sub-models to layers of in-memory compute units and uses hash-based routing functions to select which memory layers to activate — eliminating the traditional DRAM bandwidth bottleneck by performing computation directly within memory.
Intel Corporation’s MoE hot expert buffer architecture uses a two-tier system: a full hot expert buffer for frequently used experts and a partial hot expert buffer for moderately used experts, with rarely used experts stored in lower-cost memory — reducing effective memory bandwidth requirements per MoE inference call.
Predictive prefetching to hide I/O latency
UESTC’s 2026 CN patent proposes training a dedicated expert activation prediction model that learns the mapping from prompt data to expert activation patterns layer by layer. During inference, this predictor forecasts which experts will be needed across all layers for an incoming request, allowing all necessary experts to be preloaded in a single batch I/O operation rather than sequentially — eliminating the serialization between expert selection and expert execution that otherwise becomes a dominant latency bottleneck. Harbin Institute of Technology’s 2025 CN patent pairs a layer-level predictor with a token-level predictor and an LRU-managed cache, mirroring the ProMoE approach of predicting layer i+2 experts while computing layer i to overlap PCIe transfer with computation.
Shanghai Jiao Tong University’s 2026 CN patent takes a speculative approach: a lightweight quantized draft model predicts target model expert activations, and corresponding full-precision expert weights are asynchronously prefetched. This system achieves up to 5.9× inference speedup over naive offloading and 1.8× improvement over fine-grained offloading frameworks on models such as Phi-MoE. Peking University’s 2024 CN patent adds a sensitivity-based adaptive expert gating mechanism that dynamically adjusts how many experts are activated per layer per input, achieving an average 25% reduction in expert activations across inference runs without accuracy degradation.
Explore the full MoE patent landscape — routing, prefetching, and quantization innovations — with PatSnap Eureka.
Search MoE Patents in PatSnap Eureka →Mixed-precision quantization tuned to activation frequency
Quantization provides an orthogonal compression axis. Xi’an Jiaotong University’s 2025 CN patent proposes mixed-precision quantization that assigns high-precision formats (BF16) to frequently activated experts and lower-precision formats (INT4 or INT8) to rarely activated ones — acknowledging that uniform quantization strategies fail to exploit the non-uniform activation frequency distribution of MoE expert pools. Muxi Integrated Circuit’s 2026 CN patent similarly implements nested weight quantization followed by dynamic bit-width selection at inference time, adding bit-width-aware reordering of active sub-models to improve execution efficiency. According to IEEE research on quantization-aware inference, non-uniform precision assignment consistently outperforms uniform quantization when activation distributions are skewed — precisely the condition that MoE’s top-K routing creates.
Distributed execution and parallelism at frontier scale: masking the all-to-all overhead
At frontier scale — models such as DeepSeek-R1 with 671 billion parameters — no single accelerator can hold all expert weights in GPU memory, making distributed expert parallelism essential. The MoE routing mechanism introduces a distinctive communication pattern: tokens must be dispatched across GPU boundaries to reach the GPU holding the selected expert, and results must be gathered back. This all-to-all communication is a major source of overhead in distributed MoE inference, and patents from Chinese universities and Microsoft address it from multiple angles.
DeepSeek-R1 has 671 billion parameters, requiring distributed expert parallelism because no single accelerator can hold all expert weights in GPU memory. The MoE routing mechanism forces tokens to cross GPU boundaries to reach selected experts, making all-to-all communication a primary source of inference overhead at this scale.
The University of Science and Technology of China’s 2025 CN patent “Asynchronous Parallel Inference Method for MoE Models” addresses this by decoupling GPU computation from the all-to-all collective communication inherent in expert parallelism. Token-data communication proceeds asynchronously and in parallel with model computation, masking communication latency and eliminating synchronization wait overheads. Additionally, “hot” experts are co-located on GPU memory while “cold” experts are offloaded to CPU, enabling larger batch sizes and improved GPU utilization.
Hangzhou Dianzi University’s 2025 CN patent decomposes the global AllGather communication used in sequence-parallel MoE training into per-expert AllGather operations, then pipelines each expert’s AllGather with the preceding expert’s computation. Since token computations for different experts are independent, this overlap is mathematically valid and significantly reduces idle compute time. A companion patent from the same institution uses a load model built from sampled token features, expert parameters, and memory/compute resource data to drive expert scheduling that minimises all-to-all communication delay and dynamically adjusts expert capacity values.
Microsoft’s 2024 US patent “Mixture-of-Experts Layer with Switchable Parallel Modes” solves the parallelism-mode switching problem: the MoE layer can toggle between data parallel and expert-data-model parallel modes without transferring expert sub-model weights between processing devices, reducing the overhead of adapting to varying batch size and throughput conditions at runtime. Intel’s 2025 US patent “Edge Deployment of a Mixture of Experts Architecture” extends distributed MoE to heterogeneous edge environments, launching selected expert models across multiple edge nodes with dynamic instance scaling based on service level agreements or trends detected in input data. As noted by WIPO in its analysis of AI patent trends, distributed inference architectures have become one of the fastest-growing sub-categories in AI hardware and systems filings.
The University of Science and Technology of China’s asynchronous parallel inference patent demonstrates that decoupling token dispatch communication from GPU computation is the critical design move for maintaining high GPU utilisation in distributed MoE deployments — hot experts remain on GPU memory while cold experts are offloaded to CPU, enabling larger effective batch sizes without stalling compute pipelines.
Track distributed MoE patent filings from Intel, Microsoft, and Chinese universities in real time with PatSnap Eureka.
Explore MoE Patent Intelligence in PatSnap Eureka →Converting and fine-tuning existing models for MoE cost savings without training from scratch
MoE inference cost reduction does not require building a model from scratch — a growing body of patents addresses how to retrofit MoE savings onto existing dense model assets, or how to expand deployed MoE models incrementally.
Samsung SDS’s 2025 US and KR patents describe converting a dense pre-trained language model into an MoE architecture by extracting the feed-forward network (FFN) from each transformer layer, replacing it with an MoE block, and then optimizing the MoE block weights to match the original FFN output on training data. This approach recovers the inference cost savings of MoE without requiring training from scratch — a significant advantage for organisations with large existing dense model investments. According to PatSnap’s innovation intelligence research, dense-to-sparse model conversion is an emerging category in the broader AI efficiency patent landscape.
Baidu’s 2025 CN patent “MoE Fine-Tuning Method” applies Low-Rank Adaptation (LoRA) consistently to all linear layers of both expert modules and shared non-MoE modules, using differentiated low-bit quantization for each module type to reduce memory footprint during fine-tuning. The distributed training framework supports expert parallelism, allowing fine-tuning of very large MoE models within feasible memory budgets while maintaining model performance.
Google LLC’s 2025 CN patent “Lifelong Pre-Training of MoE Neural Networks” describes a continual training framework in which new expert sub-networks are added to an existing MoE model while previously trained expert parameters are frozen. This partial expansion avoids full model retraining when incorporating new data distributions, saves compute resources relative to training from scratch, and allows the model to be deployed on edge devices with only small incremental updates when new capabilities are required.
Robert Bosch’s 2026 US patent provides a systematic methodology for determining which layers of an existing neural network are best replaced with MoE blocks: candidate architectures with surrogate MoE layers at different positions are trained and validated against ground truth, and the configuration yielding the best accuracy is selected as the optimal insertion point. This layer-selection procedure maximises the accuracy-per-FLOP benefit of sparse MoE gating. Guidance from NIST on AI model evaluation frameworks increasingly emphasises this kind of systematic architecture search as a best practice for responsible AI deployment.
Samsung SDS’s 2025 patents describe converting a dense pre-trained language model into an MoE architecture by extracting the feed-forward network (FFN) from each transformer layer, replacing it with an MoE block, and optimising the MoE block weights to match the original FFN output — recovering MoE inference cost savings without training from scratch.
Who holds the MoE patent landscape: leading assignees and innovation trends
The MoE patent dataset of more than 30 filings spans the United States, China, Japan, South Korea, and the European Union, with a clear concentration of activity in inference-time optimisation rather than training-time efficiency — reflecting the maturation of MoE from a training-efficiency tool into a deployment-efficiency infrastructure.
Intel Corporation holds two detailed US filings covering distributed edge deployment and hardware-level buffer management for expert weight access. Microsoft Technology Licensing LLC holds three US filings addressing gating dynamics, sparse dispatch/collect, and parallelism mode switching. Qualcomm holds Chinese and Japanese family members covering per-sample routing, early exit, and ensemble integration. Google LLC focuses on continual learning and incremental model expansion via a CN filing. Samsung SDS holds both Korean and US family filings on dense-to-MoE model conversion.
Chinese research universities collectively represent the most active national cluster in MoE inference optimisation. Shanghai Jiao Tong University holds two filings on speculative offloading and expert access prediction. Hangzhou Dianzi University holds two filings on load balancing and communication optimisation. Peking University, Harbin Institute of Technology, UESTC, and the University of Science and Technology of China each contribute filings with strong focus on memory-constrained and distributed deployment scenarios. This concentration aligns with broader trends documented by WIPO, which has tracked a significant rise in Chinese AI infrastructure patent filings since 2022.
A clear trend across the dataset is the increasing focus on inference-time optimisation — prefetching, caching, quantization, and async parallelism — rather than solely training-time efficiency. This reflects the maturation of MoE from a training-efficiency tool into a deployment-efficiency infrastructure, a shift that has significant implications for IP strategy at organisations building or deploying frontier AI systems. The PatSnap innovation intelligence platform tracks these filing trends in real time across all major jurisdictions.