Book a demo

Cut patent&paper research from weeks to hours with PatSnap Eureka AI!

Try now

LLM Token Latency: Speculative Decoding — PatSnap Eureka

LLM Token Latency: Speculative Decoding — PatSnap Eureka
LLM Inference Intelligence

Reduce Token Generation Latency with Speculative Decoding & Continuous Batching

Autoregressive token generation is inherently sequential and compute-inefficient under high-throughput serving conditions. This analysis of 20+ patents from Microsoft, Google, Samsung, Qualcomm, and others reveals the state-of-the-art strategies for breaking the one-token-per-step bottleneck in large language model serving.

LLM Inference Phases: Prefill vs Decode — Prefill is parallelizable, Decode generates one token per step creating GPU underutilization Illustration of the two-phase LLM inference decomposition: the parallelizable prefill phase and the sequential decode phase that generates one token at a time, leading to severe GPU underutilization. Speculative decoding and continuous batching address this bottleneck. Source: PatSnap Eureka patent analysis 2024–2026. PHASE PARALLELISM GPU UTIL. Prefill Phase Processes input prompts ✓ Parallelizable High (GPU sat.) Decode Phase Generates 1 token per step ✗ Sequential Low (underutil.) SOLUTIONS Speculative Decoding Draft model proposes, target verifies Continuous Batching Insert requests mid-batch, maximize util. + Source: PatSnap Eureka · 20+ patents · 2024–2026
20+
Patents & disclosures analyzed (2024–2026)
7+
Major assignees including Microsoft, Google, Samsung
71%
Runtime reduction via streaming batched beam search (UC Berkeley)
2
Core architectural responses: speculative decoding & continuous batching
Speculative Decoding

Breaking the One-Token-Per-Step Bottleneck

Speculative decoding reframes autoregressive generation as a cooperative inference problem. A fast draft model proposes candidate tokens; the target model verifies all of them in a single forward pass—enabling multiple tokens per step. According to patent landscape analysis, five distinct architectural variants have emerged across 2024–2026 filings.

Microsoft · 2025

Guidance-Augmented Speculative Decoding

The target model not only accepts or rejects draft tokens but also produces guidance vector information fed back to the draft model in subsequent cycles. This guidance informs the draft model of the embedding space used by the target model, tightening alignment between the two models and increasing the acceptance rate of proposed tokens.

Guidance vector feedback loop
Microsoft · 2025

Selective Speculative Decoding

The system dynamically decides, on a per-iteration basis, whether speculative decoding is beneficial. Output tokens are partitioned into two portions: one computed via speculative decoding using drafting models, and a second computed directly at the primary model. This avoids the overhead of draft model invocation when acceptance rates would be low.

Adaptive per-context switching
Samsung · 2025

Self-Speculative Decoding

Eliminates the need for a separate draft model entirely. Specific transformer blocks are compressed or removed from the base model—identified through parameter analysis—and fine-tuned to serve as the draft head. The fine-tuned model shares weights with the target, reducing memory overhead and avoiding KV-cache duplication.

No separate draft model needed
Qualcomm · 2025

Beam-Search-Based Speculative Decoding

The draft model generates not a single candidate sequence but a set of token subsets, each corresponding to a beam in a beam search. These candidate sets are passed en masse to the target model for verification, allowing selection of the best sequence from a richer proposal distribution—particularly advantageous when greedy decoding misses high-probability token paths.

Richer beam-based proposals
Snowflake · 2026

Suffix-Based Hybrid Speculative Decoding

A dual-path architecture combines suffix-based speculative decoding—leveraging repetitive token patterns in agentic workloads by matching suffixes of the current context against prior generation history—with the conventional draft-model approach. The system adapts to workload type at runtime, accelerating both repetitive and novel token generation patterns.

Agentic workload optimized
China North Vehicle Research Institute · 2024

Interrupt-Driven Concurrent Verification

The small language model runs in parallel with the large model, generating NUM candidate token segments and submitting them incrementally. The target model interrupts its own inference to perform probabilistic verification on each received segment, minimizing idle time on the target model through continuous pipeline utilization.

Concurrent pipeline execution
PatSnap Eureka

Search Every Speculative Decoding Patent Instantly

Access full claim text, citation graphs, and assignee timelines for all 20+ patents in this analysis.

Search Speculative Decoding Patents →
Continuous & Dynamic Batching

Maximizing GPU Utilization Across Heterogeneous Request Streams

The decode phase of LLM inference suffers from poor GPU compute utilization because generating one token per request per step severely underloads parallel hardware. Continuous batching addresses this by allowing new requests to be inserted into an active batch at any point, rather than waiting for all requests in a batch to complete. According to PatSnap's innovation platform, three distinct batching strategies have been patented across 2024–2025.

Decode-maximal batching with chunked prefill interleaving (Microsoft, 2025) splits prefill requests into equal-sized chunks and constructs a hybrid micro-batch with a single prefill chunk filling the compute headroom left by a maximal number of decode requests. This prevents prefill computation from monopolizing GPU resources and blocking decode steps, improving both throughput and time-to-first-token latency simultaneously.

Selective operation batching (FriendliAI, 2025) addresses the variable-length problem. Because requests differ in input length, output length, and KV-cache state length, naive static batching forces all requests to the same padded length, wasting compute on padding tokens. FriendliAI's approach batches only length-invariant operations (e.g., linear projections) across all requests, while length-sensitive operations (e.g., attention) are processed individually per request.

The academic foundation for streaming batch management comes from UC Berkeley's 2020 work on batched beam search, which demonstrated up to 71% runtime reduction by periodically refilling a decoding batch as candidates terminate—rather than waiting for the entire batch to complete. This streaming refill principle directly informs modern continuous batching implementations.

71%
Runtime reduction via streaming refill vs fixed-width beam search (UC Berkeley)
2
Inference phases: parallelizable prefill + sequential decode
3
Distinct batching strategies patented 2024–2025
EP
FriendliAI selective batching patent jurisdiction (2025)
Key Batching Innovations
  • Chunked prefill prevents GPU monopolization
  • Selective batching eliminates padding waste
  • Output length prediction for proactive scheduling
  • Energy-aware DRL power selection (Southeast Univ.)
  • Streaming refill for continuous batch throughput
Explore Batching Patents
Patent Landscape Data

Visualizing the LLM Inference Innovation Landscape

Data derived from 20+ patents and technical disclosures spanning 2024–2026, analyzed via PatSnap Eureka's patent intelligence platform.

Patent Filings by Assignee (2024–2026)

Microsoft leads with 3+ distinct patent families; Google holds 5 filings across WO, US, India, and CN jurisdictions. Chinese academic groups collectively contribute 6 patent families.

LLM Inference Patent Filings by Assignee 2024–2026: Microsoft 3 families, Google 5 filings, Chinese Groups 6 families, Samsung 1, Qualcomm 1, FriendliAI 1, Snowflake 1 Horizontal bar chart showing patent filing activity in LLM inference latency reduction by major assignee from 2024 to 2026. Microsoft Technology Licensing LLC leads with 3 distinct patent families; Google LLC holds 5 filings across multiple jurisdictions; Chinese academic and industrial groups collectively contribute 6 patent families. Source: PatSnap Eureka patent analysis. 0 1 2 3 4 5 6 Chinese Groups 5 Google LLC 3 Microsoft 1 Samsung 1 Qualcomm 1 FriendliAI 1 Snowflake

Innovation Category Distribution

Speculative decoding variants account for the majority of filings, with continuous/dynamic batching and model architecture optimizations forming the remaining categories across the 20+ patent dataset.

LLM Inference Innovation Categories: Speculative Decoding ~55%, Continuous/Dynamic Batching ~25%, Model Architecture Optimizations ~20% Donut chart showing distribution of LLM inference latency reduction innovations across three categories derived from 20+ patent analysis 2024–2026 via PatSnap Eureka. Speculative decoding variants are the dominant category, followed by batching strategies and model architecture optimizations. 20+ patents Speculative Decoding ~55% of filings Continuous Batching ~25% of filings Model Architecture ~20% of filings Source: PatSnap Eureka · 2024–2026 patent analysis

Want to run your own patent landscape on LLM inference optimization?

Build Your Patent Landscape on Eureka →
Model Architecture

Architectural Approaches That Reduce Per-Token Compute

Beyond algorithmic frameworks, several architectural innovations directly reduce per-token compute cost or the total number of tokens required, with applications ranging from cloud serving to on-device inference.

🔀

Dynamic Model Selection Routing (Google, 2024)

Each incoming request is dynamically matched to the most computationally efficient model capable of handling it accurately. Smaller models—whether separately trained, pruned, or quantized—are preferred for requests where they yield adequate quality, while larger models are reserved for requests requiring higher capability. This avoids the fixed overhead of always routing to the largest model.

Two-Stage LLM Bridging Pipeline (Google, 2025)

A smaller LLM generates an immediate partial response that is rendered to the user without delay, while a larger LLM begins generating a refined continuation starting from the partial response. This pipeline minimizes perceived time-to-first-token by decoupling the speed of the immediate response from the quality of the final output—particularly valuable for voice assistant applications.

🔒
Unlock 2 More Architecture Innovations
See how Tianjin University and Shandong Inspur enable parallel decoding without separate draft models—key for resource-constrained deployments.
PAD Token Parallel Decoding Intermediate-Layer Speculation + full patent analysis
Access Full Analysis on Eureka →
Competitive Intelligence

Key Assignees & Their Innovation Focus

Analysis of the patent data reveals a concentrated set of major technology companies and emerging AI infrastructure specialists active in this space. Multi-jurisdictional filings signal global IP strategy.

Assignee Primary Technique Key Innovation Jurisdiction Year
Microsoft Technology Licensing LLC Speculative Decoding + Chunked Prefill Guidance vector feedback; selective per-context switching; decode-maximal batching US, WO 2025
Google LLC Model Routing + Multi-Model Bridging Dynamic model selection routing; two-stage partial response pipeline WO, US, IN, CN 2024–2025
Samsung Electronics Self-Speculative Decoding Compressed base model as draft head; shared weights; no separate model US 2025
Qualcomm Incorporated Beam-Search Speculative Decoding Token subset proposals via beam search for richer verification US 2025
🔒
See All 7 Assignees & Full Details
FriendliAI, Snowflake, Southeast University, and 4 more Chinese groups with full technique breakdowns and patent links.
FriendliAI EP patent Snowflake 2026 filing + 5 more assignees
View Full Competitive Table →

Track New LLM Inference Filings in Real Time

PatSnap Eureka monitors 120+ patent offices globally. Set alerts for new speculative decoding and batching patents as they publish.

Set Patent Alerts on Eureka →
Scheduling & Energy Optimization

Proactive Scheduling via Output Length Prediction

Scheduling at the request level is addressed by a Chinese academic patent from Shandong University of Finance and Economics (2025), which fine-tunes a small auxiliary model to predict output response length before inference begins. These length predictions are used to build better-balanced batches and implement a priority scheduling policy. An error-handling mechanism compares predicted versus actual output length at runtime, updating scheduling decisions dynamically.

Southeast University's energy-aware scheduling approach (2025) uses a fine-tuned Qwen2-1.5B model to predict the output token count of a target Llama3-8B model on each request. The predicted lengths feed into a balanced sorting and scheduling policy that improves throughput, and a deep reinforcement learning power selection algorithm further reduces energy consumption while meeting latency SLAs. This is notable as one of the few approaches in the dataset to jointly optimize latency and energy efficiency.

These scheduling approaches are complementary to dynamic batching hardware optimizations—operating at the request scheduling layer rather than the kernel or memory management layer. Together, they form a complete stack from hardware utilization to per-request SLA management. Researchers at IEEE and ACM have published extensively on the theoretical foundations underpinning these scheduling strategies, while PatSnap's domain-specific analytics enable practitioners to map the patent landscape across all three optimization layers simultaneously.

LLM Request Scheduling Pipeline: Predict output length → Balance batches → Priority schedule → DRL power selection → Meet latency SLA Process diagram showing the output-length-prediction-driven scheduling pipeline for LLM inference, as patented by Shandong University of Finance and Economics and Southeast University. A fine-tuned auxiliary model predicts token output length; predictions feed batch balancing and priority scheduling; a DRL algorithm optimizes power while meeting latency SLAs. Source: PatSnap Eureka 2025 patent analysis. Scheduling Pipeline 1. Predict Output Token Length Fine-tuned auxiliary model (e.g. Qwen2-1.5B) 2. Balance Batch Composition Sorted scheduling, heterogeneous requests 3. Priority Schedule + DRL Power Deep RL reduces energy, meets latency SLA 4. Runtime Error Correction Predicted vs actual length → dynamic update Source: PatSnap Eureka · 2025 patents
Frequently asked questions

LLM Token Latency Reduction — Key Questions Answered

Still have questions about LLM inference optimization? Let PatSnap Eureka answer them for you.

Ask Eureka AI About LLM Inference Patents →
PatSnap Eureka

Accelerate Your LLM Inference R&D with Patent Intelligence

Join 18,000+ innovators already using PatSnap Eureka to map the competitive landscape, identify white spaces, and track emerging techniques in AI infrastructure.

References

  1. Expediting Generative Token Production using Speculative Sampling, Added Guidance, and Language Models of Different Capacities — Microsoft Technology Licensing LLC, 2025
  2. Expediting generative token production using speculative sampling, added guidance, and language models of different capacities (WO) — Microsoft Technology Licensing LLC, 2025
  3. Selective speculative decoding — Microsoft Technology Licensing LLC, 2025
  4. Efficient self-speculative decoding architecture for increasing LLM inference throughput — Samsung Electronics Co., Ltd., 2025
  5. Efficient speculative decoding in autoregressive generative artificial intelligence models — Qualcomm Incorporated, 2025
  6. Suffix-based speculative token decoding for artificial intelligence model — Snowflake Inc., 2026
  7. Large language model inference by piggybacking decodes with chunked prefills — Microsoft Technology Licensing LLC, 2025
  8. Dynamic batching for inference system for transformer-based generation tasks — FriendliAI Inc., 2025
  9. Dynamic selection from among multiple candidate generative models with differing computational efficiencies (WO) — Google LLC, 2024
  10. Dynamic selection from among multiple candidate generative models with differing computational efficiencies (US) — Google LLC, 2024
  11. LLM latency reduction via bridging multiple LLMs of differing sizes — Google LLC, 2025
  12. A large language model service request scheduling method and system — Shandong University of Finance and Economics, 2025
  13. A LLM inference computing service energy consumption optimization scheduling method based on task length prediction — Southeast University, 2025
  14. A speculative decoding-based LLM inference acceleration method and apparatus — China North Vehicle Research Institute, 2024
  15. A large language model inference optimization method based on cascading and speculative decoding strategies — Paiyao Cloud Computing (Shanghai) Co., Ltd., 2024
  16. A large language model training and decoding method — Tianjin University, 2024
  17. LLM acceleration method based on intermediate-layer decoding — Shandong Inspur Science Research Institute Co., Ltd., 2024
  18. A Streaming Approach For Efficient Batched Beam Search — UC Berkeley, 2020
  19. arXiv — AI and Machine Learning preprints
  20. IEEE — Institute of Electrical and Electronics Engineers
  21. ACM — Association for Computing Machinery

All data and statistics on this page are sourced from the references above and from PatSnap's proprietary innovation intelligence platform.

Ask PatSnap Eureka
Ask PatSnap Eureka
AI innovation intelligence · always on
Ask anything about LLM inference latency.
PatSnap Eureka searches patents and research to answer instantly.
Try asking
Powered by PatSnap Eureka