LLM Token Latency: Speculative Decoding — PatSnap Eureka
Reduce Token Generation Latency with Speculative Decoding & Continuous Batching
Autoregressive token generation is inherently sequential and compute-inefficient under high-throughput serving conditions. This analysis of 20+ patents from Microsoft, Google, Samsung, Qualcomm, and others reveals the state-of-the-art strategies for breaking the one-token-per-step bottleneck in large language model serving.
Breaking the One-Token-Per-Step Bottleneck
Speculative decoding reframes autoregressive generation as a cooperative inference problem. A fast draft model proposes candidate tokens; the target model verifies all of them in a single forward pass—enabling multiple tokens per step. According to patent landscape analysis, five distinct architectural variants have emerged across 2024–2026 filings.
Guidance-Augmented Speculative Decoding
The target model not only accepts or rejects draft tokens but also produces guidance vector information fed back to the draft model in subsequent cycles. This guidance informs the draft model of the embedding space used by the target model, tightening alignment between the two models and increasing the acceptance rate of proposed tokens.
Guidance vector feedback loopSelective Speculative Decoding
The system dynamically decides, on a per-iteration basis, whether speculative decoding is beneficial. Output tokens are partitioned into two portions: one computed via speculative decoding using drafting models, and a second computed directly at the primary model. This avoids the overhead of draft model invocation when acceptance rates would be low.
Adaptive per-context switchingSelf-Speculative Decoding
Eliminates the need for a separate draft model entirely. Specific transformer blocks are compressed or removed from the base model—identified through parameter analysis—and fine-tuned to serve as the draft head. The fine-tuned model shares weights with the target, reducing memory overhead and avoiding KV-cache duplication.
No separate draft model neededBeam-Search-Based Speculative Decoding
The draft model generates not a single candidate sequence but a set of token subsets, each corresponding to a beam in a beam search. These candidate sets are passed en masse to the target model for verification, allowing selection of the best sequence from a richer proposal distribution—particularly advantageous when greedy decoding misses high-probability token paths.
Richer beam-based proposalsSuffix-Based Hybrid Speculative Decoding
A dual-path architecture combines suffix-based speculative decoding—leveraging repetitive token patterns in agentic workloads by matching suffixes of the current context against prior generation history—with the conventional draft-model approach. The system adapts to workload type at runtime, accelerating both repetitive and novel token generation patterns.
Agentic workload optimizedInterrupt-Driven Concurrent Verification
The small language model runs in parallel with the large model, generating NUM candidate token segments and submitting them incrementally. The target model interrupts its own inference to perform probabilistic verification on each received segment, minimizing idle time on the target model through continuous pipeline utilization.
Concurrent pipeline executionMaximizing GPU Utilization Across Heterogeneous Request Streams
The decode phase of LLM inference suffers from poor GPU compute utilization because generating one token per request per step severely underloads parallel hardware. Continuous batching addresses this by allowing new requests to be inserted into an active batch at any point, rather than waiting for all requests in a batch to complete. According to PatSnap's innovation platform, three distinct batching strategies have been patented across 2024–2025.
Decode-maximal batching with chunked prefill interleaving (Microsoft, 2025) splits prefill requests into equal-sized chunks and constructs a hybrid micro-batch with a single prefill chunk filling the compute headroom left by a maximal number of decode requests. This prevents prefill computation from monopolizing GPU resources and blocking decode steps, improving both throughput and time-to-first-token latency simultaneously.
Selective operation batching (FriendliAI, 2025) addresses the variable-length problem. Because requests differ in input length, output length, and KV-cache state length, naive static batching forces all requests to the same padded length, wasting compute on padding tokens. FriendliAI's approach batches only length-invariant operations (e.g., linear projections) across all requests, while length-sensitive operations (e.g., attention) are processed individually per request.
The academic foundation for streaming batch management comes from UC Berkeley's 2020 work on batched beam search, which demonstrated up to 71% runtime reduction by periodically refilling a decoding batch as candidates terminate—rather than waiting for the entire batch to complete. This streaming refill principle directly informs modern continuous batching implementations.
Visualizing the LLM Inference Innovation Landscape
Data derived from 20+ patents and technical disclosures spanning 2024–2026, analyzed via PatSnap Eureka's patent intelligence platform.
Patent Filings by Assignee (2024–2026)
Microsoft leads with 3+ distinct patent families; Google holds 5 filings across WO, US, India, and CN jurisdictions. Chinese academic groups collectively contribute 6 patent families.
Innovation Category Distribution
Speculative decoding variants account for the majority of filings, with continuous/dynamic batching and model architecture optimizations forming the remaining categories across the 20+ patent dataset.
Architectural Approaches That Reduce Per-Token Compute
Beyond algorithmic frameworks, several architectural innovations directly reduce per-token compute cost or the total number of tokens required, with applications ranging from cloud serving to on-device inference.
Dynamic Model Selection Routing (Google, 2024)
Each incoming request is dynamically matched to the most computationally efficient model capable of handling it accurately. Smaller models—whether separately trained, pruned, or quantized—are preferred for requests where they yield adequate quality, while larger models are reserved for requests requiring higher capability. This avoids the fixed overhead of always routing to the largest model.
Two-Stage LLM Bridging Pipeline (Google, 2025)
A smaller LLM generates an immediate partial response that is rendered to the user without delay, while a larger LLM begins generating a refined continuation starting from the partial response. This pipeline minimizes perceived time-to-first-token by decoupling the speed of the immediate response from the quality of the final output—particularly valuable for voice assistant applications.
Key Assignees & Their Innovation Focus
Analysis of the patent data reveals a concentrated set of major technology companies and emerging AI infrastructure specialists active in this space. Multi-jurisdictional filings signal global IP strategy.
| Assignee | Primary Technique | Key Innovation | Jurisdiction | Year |
|---|---|---|---|---|
| Microsoft Technology Licensing LLC | Speculative Decoding + Chunked Prefill | Guidance vector feedback; selective per-context switching; decode-maximal batching | US, WO | 2025 |
| Google LLC | Model Routing + Multi-Model Bridging | Dynamic model selection routing; two-stage partial response pipeline | WO, US, IN, CN | 2024–2025 |
| Samsung Electronics | Self-Speculative Decoding | Compressed base model as draft head; shared weights; no separate model | US | 2025 |
| Qualcomm Incorporated | Beam-Search Speculative Decoding | Token subset proposals via beam search for richer verification | US | 2025 |
Track New LLM Inference Filings in Real Time
PatSnap Eureka monitors 120+ patent offices globally. Set alerts for new speculative decoding and batching patents as they publish.
Proactive Scheduling via Output Length Prediction
Scheduling at the request level is addressed by a Chinese academic patent from Shandong University of Finance and Economics (2025), which fine-tunes a small auxiliary model to predict output response length before inference begins. These length predictions are used to build better-balanced batches and implement a priority scheduling policy. An error-handling mechanism compares predicted versus actual output length at runtime, updating scheduling decisions dynamically.
Southeast University's energy-aware scheduling approach (2025) uses a fine-tuned Qwen2-1.5B model to predict the output token count of a target Llama3-8B model on each request. The predicted lengths feed into a balanced sorting and scheduling policy that improves throughput, and a deep reinforcement learning power selection algorithm further reduces energy consumption while meeting latency SLAs. This is notable as one of the few approaches in the dataset to jointly optimize latency and energy efficiency.
These scheduling approaches are complementary to dynamic batching hardware optimizations—operating at the request scheduling layer rather than the kernel or memory management layer. Together, they form a complete stack from hardware utilization to per-request SLA management. Researchers at IEEE and ACM have published extensively on the theoretical foundations underpinning these scheduling strategies, while PatSnap's domain-specific analytics enable practitioners to map the patent landscape across all three optimization layers simultaneously.
LLM Token Latency Reduction — Key Questions Answered
Speculative decoding fundamentally reframes autoregressive generation as a cooperative inference problem between a fast, lightweight draft model and a slower, higher-capacity target model. The core principle is that the draft model proposes a sequence of candidate tokens, and the target model verifies all proposed tokens in a single forward pass—accepting those that match its own distribution and rejecting the rest. This allows the target model to effectively produce multiple tokens per forward pass, breaking the one-token-per-step bottleneck.
Selective speculative decoding dynamically decides, on a per-iteration basis, whether speculative decoding is beneficial at all. The computing system partitions output tokens into two portions: a first portion computed via speculative decoding using one or more drafting models, and a second portion computed directly at the primary model without speculative decoding. This context-sensitive logic avoids the overhead of draft model invocation in cases where speculative decoding would yield low acceptance rates, making the system adaptive rather than statically committed to a single decoding path.
Self-speculative decoding eliminates the need for a separate draft model entirely. Samsung's approach compresses or removes specific transformer blocks from the base model—identified through parameter analysis—and then fine-tunes the resulting lighter model to serve as the draft head. The fine-tuned model shares weights with the target model, reducing memory overhead while preserving the speculative verification mechanism. This single-model approach avoids the memory bandwidth and KV-cache overhead of maintaining two separate model instances, making it key for edge and on-device deployment.
Prefill requests are split into equal-sized chunks, and a hybrid micro-batch is constructed with a single prefill chunk filling the compute headroom left by a maximal number of decode requests. This prevents prefill computation from monopolizing GPU resources and blocking decode steps, which improves both throughput and time-to-first-token latency simultaneously. The mismatch between prefill-phase GPU saturation and decode-phase compute underutilization is identified as the root cause of inference inefficiency.
A fine-tuned auxiliary model predicts output response length before inference begins. These length predictions are used to build better-balanced batches and implement a priority scheduling policy. An error-handling mechanism compares predicted versus actual output length at runtime, updating scheduling decisions dynamically. Southeast University's approach uses a fine-tuned Qwen2-1.5B model to predict the output token count of a target Llama3-8B model on each request, and a deep reinforcement learning power selection algorithm further reduces energy consumption while meeting latency SLAs.
Microsoft Technology Licensing LLC is the most prolific filer in this dataset, with at least three distinct patent families covering speculative decoding with guidance vector feedback, selective speculative decoding, and chunked prefill with decode-maximal batching. Google LLC holds multiple patent families on model selection routing and multi-model bridging, with multi-jurisdictional coverage including WO, US, India, and CN filings. Samsung Electronics, Qualcomm Incorporated, FriendliAI Inc., and Snowflake Inc. are also active, alongside Chinese academic and industrial groups including Southeast University, Shandong University of Finance and Economics, Tianjin University, and others.
Still have questions about LLM inference optimization? Let PatSnap Eureka answer them for you.
Ask Eureka AI About LLM Inference Patents →Accelerate Your LLM Inference R&D with Patent Intelligence
Join 18,000+ innovators already using PatSnap Eureka to map the competitive landscape, identify white spaces, and track emerging techniques in AI infrastructure.
References
- Expediting Generative Token Production using Speculative Sampling, Added Guidance, and Language Models of Different Capacities — Microsoft Technology Licensing LLC, 2025
- Expediting generative token production using speculative sampling, added guidance, and language models of different capacities (WO) — Microsoft Technology Licensing LLC, 2025
- Selective speculative decoding — Microsoft Technology Licensing LLC, 2025
- Efficient self-speculative decoding architecture for increasing LLM inference throughput — Samsung Electronics Co., Ltd., 2025
- Efficient speculative decoding in autoregressive generative artificial intelligence models — Qualcomm Incorporated, 2025
- Suffix-based speculative token decoding for artificial intelligence model — Snowflake Inc., 2026
- Large language model inference by piggybacking decodes with chunked prefills — Microsoft Technology Licensing LLC, 2025
- Dynamic batching for inference system for transformer-based generation tasks — FriendliAI Inc., 2025
- Dynamic selection from among multiple candidate generative models with differing computational efficiencies (WO) — Google LLC, 2024
- Dynamic selection from among multiple candidate generative models with differing computational efficiencies (US) — Google LLC, 2024
- LLM latency reduction via bridging multiple LLMs of differing sizes — Google LLC, 2025
- A large language model service request scheduling method and system — Shandong University of Finance and Economics, 2025
- A LLM inference computing service energy consumption optimization scheduling method based on task length prediction — Southeast University, 2025
- A speculative decoding-based LLM inference acceleration method and apparatus — China North Vehicle Research Institute, 2024
- A large language model inference optimization method based on cascading and speculative decoding strategies — Paiyao Cloud Computing (Shanghai) Co., Ltd., 2024
- A large language model training and decoding method — Tianjin University, 2024
- LLM acceleration method based on intermediate-layer decoding — Shandong Inspur Science Research Institute Co., Ltd., 2024
- A Streaming Approach For Efficient Batched Beam Search — UC Berkeley, 2020
- arXiv — AI and Machine Learning preprints
- IEEE — Institute of Electrical and Electronics Engineers
- ACM — Association for Computing Machinery
All data and statistics on this page are sourced from the references above and from PatSnap's proprietary innovation intelligence platform.
PatSnap Eureka searches patents and research to answer instantly.