The patent landscape: who is building zero-shot robot generalization
The foundation model robotics patent landscape is dominated by a small number of technology companies making large, coordinated bets. Based on a dataset of more than 40 filings across CN, JP, KR, US, EP, WO, and BR jurisdictions spanning 2018 to 2026, Google LLC / GDM Holdings leads with approximately 12 filings — covering LLM grounding, multi-modal embeddings, hierarchical navigation, meta-learning, and environment-conditioned action sequencing. This breadth signals a deliberate strategy to own the full foundation model stack for robotics, not merely a single technique.
Beyond Google, Naver / Naver Labs has filed 4 patents, establishing a focused IP position around transformer-based meta-imitation learning active in both Japanese and Korean jurisdictions. Huawei Technologies, NVIDIA Corporation, and X Development LLC each account for 3 filings. Academic and industrial filers — including Tsinghua University, Nanjing University, Zhejiang University, Ping An Technology, and Beijing Humanoid Robot Innovation Center — are contributing innovations in simulation fine-tuning, state representation learning, and cross-modal world models, indicating a broadening research ecosystem.
The dominant technical approaches fall into four categories: transformer-based foundation models adapted via meta-learning; LLM grounding for natural language instruction following; hierarchical policy architectures; and domain adaptation for sim-to-real transfer. Each is examined in the sections below.
A patent analysis of more than 40 filings spanning 2018–2026 across CN, JP, KR, US, EP, WO, and BR jurisdictions identifies four dominant technical approaches to zero-shot robot task generalization: transformer-based meta-learning, LLM grounding, hierarchical policy architectures, and sim-to-real domain adaptation.
Transformer architectures and meta-learning: the generalization backbone
Transformer-based foundation models pre-trained across diverse tasks are the dominant backbone for zero-shot robot generalization, because they encode structural task priors that allow rapid policy adaptation from minimal new evidence at inference time — without retraining. Naver Labs has been particularly active in this space, filing two closely related patents on transformer-based meta-imitation learning. The core system comprises a transformer model trained using a two-phase meta-learning procedure: a “meta-train” phase using first demonstrations of each training task, and an “optimize” phase using second demonstrations. Critically, the number of demonstrations per training task is constrained to more than one but fewer than a fixed first predetermined number — explicitly fewer than 10 — enforcing a few-shot regime that forces the model to generalise across task structures rather than memorise specific demonstrations.
Meta-imitation learning trains a model on a diverse set of tasks so that it learns how to learn from demonstrations, rather than learning any single task. At deployment, the model conditions on a small number of new demonstrations to rapidly adapt its policy to an unseen task — without gradient updates or retraining.
Google LLC reinforces this direction through a framework that jointly trains a meta-learning model using both imitation learning (from human-guided demonstrations) and reinforcement learning (from robot trial episodes). The trained meta-learning model can then perform single-shot or few-shot learning on new tasks by conditioning on new demonstrations at inference time. This combination — pretraining encodes structural task priors, meta-learning enables rapid conditioning on new task context — is characteristic of the zero-shot generalization paradigm: the model adapts its policy outputs dynamically based on minimal new evidence, without being retrained.
“A single large transformer model conditioned on tokenized goal images can execute multiple distinct manipulation tasks across different robot morphologies and control frequencies — without task-specific retraining.”
The use of a universal tokenized action interface extends transformer generalization further. GDM Holdings’ 2026 patent describes a sequence-modeling neural network — specifically a large transformer model — conditioned jointly on tokenized goal images and tokenized observation images. The architecture produces discrete output tokens from a shared vocabulary, allowing the same model to execute multiple distinct manipulation tasks (varying robot morphologies, degrees of freedom, control frequencies) without task-specific retraining. The patent explicitly characterises the system as a “self-improving general policy” as opposed to task-specific controllers — a defining feature of zero-shot generalization.
Explore the full patent landscape for foundation model robotics and zero-shot generalization in PatSnap Eureka.
Search Patents in PatSnap Eureka →Naver Labs’ transformer-based meta-imitation learning patents constrain training to fewer than 10 demonstrations per task, enforcing a few-shot regime that forces the model to learn task-transferable policies rather than memorise specific demonstrations — a key mechanism for zero-shot robot task generalization.
LLM grounding and natural language task specification
Large language models enable zero-shot robot task generalization by mapping free-form natural language instructions to executable robot skills without requiring per-task training data — the broad linguistic and world knowledge encoded during LLM pretraining substitutes for task-specific demonstration collection. Google LLC’s patent on Natural Language Control of a Robot discloses a two-stage grounding process: the LLM first processes a free-form natural language instruction to generate a probability distribution over possible interpretations (the “task-grounding measure”), and then the system cross-references current environmental state data to determine a “world-grounding measure” reflecting the probability of the skill being successful given the current scene. A robot skill is executed only when both grounding measures jointly satisfy a threshold — a dual-grounding mechanism that prevents hallucinated or unsafe actions in novel environments.
GDM Holdings advances this direction with a “fast and slow” adaptation architecture. A generative foundation model processes human natural language feedback to predict candidate future dialog turns — each including a predicted robot action and a predicted human response — and selects actions from this candidate future. The interaction session is then used as fine-tuning data for the generative model, creating rapid in-context adaptation during deployment and slower fine-tuning between sessions. This architecture allows a robot to be taught novel tasks through conversational interaction without manual policy redesign, as described in PatSnap‘s analysis of the GDM Holdings filing.
Chain-of-thought reasoning has also emerged as a mechanism to bridge high-level language instructions and low-level robot control. A 2026 patent from Beijing Jijia Vision Technology describes constructing a chain-of-reasoning dataset to fine-tune a pretrained policy network, producing a strategy prediction model that outputs both interpretable chain-of-reasoning text and executable action data. The reward functions used during joint optimization — affordance reward, trajectory consistency reward, and output format reward — are specifically designed to promote cross-task and cross-scene generalization. According to WIPO‘s analysis of AI patent trends, natural language grounding for robotic control is among the fastest-growing sub-categories in the broader AI patent space.
Google LLC’s dual-grounding mechanism — combining a task-grounding measure (probability distribution over instruction interpretations) and a world-grounding measure (probability of skill success given the current scene) — enables zero-shot LLM-driven robot control without any per-task training data. A skill executes only when both measures jointly satisfy a threshold.
Google LLC’s Natural Language Control of a Robot patent (2025) discloses a dual-grounding mechanism for zero-shot LLM robot control: a task-grounding measure (probability distribution over instruction interpretations) combined with a world-grounding measure (probability of skill success given the current scene), where a skill executes only when both measures jointly satisfy a threshold.
Hierarchical policy architectures: separating planning from execution
Hierarchical policy architectures enable zero-shot task generalization by decoupling semantic planning — where foundation models excel — from fine-grained motor execution, so that the high-level model can adapt to new goals without retraining the entire policy stack. Google LLC’s patent on Robot Navigation Using High-Level Policy Model and Trained Low-Level Policy Model formalises this two-tier structure for mobile robot navigation: a high-level policy model trained via supervised learning on real-world observations and ground-truth navigation paths generates semantic navigation decisions, while a separately trained low-level policy model converts these high-level actions into precise, obstacle-avoiding motor commands.
This paradigm is extended to manipulation in Google LLC’s patent on Determining Environment-Conditioned Action Sequences for Robotic Tasks, where a model combining a CNN for visual encoding and a sequence-to-sequence transformer for action planning determines not only which actions to perform but also their ordering, conditioned on the current visual scene. The system’s predicted action sequence dynamically changes based on observed state — for example, whether a cabinet is open or closed — demonstrating the foundation model’s ability to reason about preconditions and adapt task plans to new environmental configurations without explicit task-specific programming. Research published by IEEE on robot learning architectures similarly highlights hierarchical decomposition as a key enabler of generalisation beyond training distributions.
Mitsubishi Electric addresses long-horizon sequential task generalization through a pre-trained learning module that encodes a dictionary of motion primitives, combined with a graph-search-based planning module. Given a new task defined only by initial and goal states, the system searches the graph for a feasible path and selects motion primitives — parameterized as Dynamic Movement Primitives (DMPs) — to compose the task trajectory. This approach generalises to new tasks by recombining existing skill primitives in new sequences rather than training task-specific policies from scratch: a form of compositional zero-shot generalization. Huawei Technologies takes a related approach through reusable skill options, training an RL agent to extract minimally correlated feature-based pseudo-rewards and corresponding option sub-policies that can be reused as compositional primitives when learning new tasks.
Track hierarchical robot policy patents and identify white-space opportunities with PatSnap Eureka’s AI-powered landscape analysis.
Analyse Robot Policy Patents in PatSnap Eureka →Mitsubishi Electric’s 2025 patent on learning sequences in robot tasks describes a pre-trained dictionary of Dynamic Movement Primitives (DMPs) combined with a graph-search planning module, enabling zero-shot generalization to new tasks by recombining existing skill primitives in new sequences — without training task-specific policies from scratch.
Sim-to-real transfer: closing the deployment gap for foundation model robotics
Sim-to-real transfer remains the primary engineering bottleneck for deploying foundation models on physical robots, because large-scale training is only feasible in simulation — yet differences in visual appearance, physics, and sensor noise can cause significant policy degradation when the robot moves to the real world. The National University of Defense Technology’s 2024 patent addresses this by training a semantic abstraction neural network using adversarial domain adaptation to align simulated and real images in a shared semantic feature space, then training a reinforcement learning policy entirely in simulation before deploying it to the real robot. By reducing the original high-dimensional image space to a semantic representation, the method also decreases state space complexity and improves transfer efficiency.
Google LLC’s patent on Robot Action ML Model Training Using Multi-Modal Embeddings addresses the sim-to-real gap through feature-level domain adaptation: a variational information bottleneck (VIB) objective is applied to encoder layers to force domain-invariant feature extraction. Multiple modality-specific action models are trained in parallel, and their outputs are dynamically weighted based on analysis of embeddings generated during inference, allowing the system to adapt its sensory fusion strategy to environmental conditions at deployment time. The NIST framework for AI robustness testing similarly identifies domain shift as a primary reliability risk for deployed AI systems.
DeepMind (Instinct Technologies / Yuanhui Technology) contributes a complementary architectural constraint: chaining a simulation-trained DNN (with fixed weights) to a real-world-trained DNN. The simulation-trained model provides learned features shared with the real-world DNN, which is initialized to replicate the simulation policy and then fine-tuned with real-world experience while keeping the simulation network frozen. This constraint prevents catastrophic forgetting of simulation-derived representations, enabling the pretrained foundation to serve as a stable generalization backbone across real-world task variants. NVIDIA Corporation prioritises sample efficiency in novel environments through a Guided Uncertainty-Aware Policy Optimization approach, where perceptual uncertainty drives dynamic switching between model-based and model-free strategies — with active filings in both 2021 and 2024 indicating sustained investment. The PatSnap resources hub provides further context on AI and robotics patent strategy.
DeepMind’s sim-to-real transfer architecture (Instinct Technologies / Yuanhui Technology, 2023–2026) chains a simulation-trained DNN with fixed weights to a real-world-trained DNN, preventing catastrophic forgetting of simulation-derived representations while enabling fine-tuning on real-world experience — a key mechanism for deploying foundation models on physical robots.