Book a demo

Cut patent&paper research from weeks to hours with PatSnap Eureka AI!

Try now

Foundation models for zero-shot robotics: 40+ patents

Foundation Models for Zero-Shot Robot Task Generalization — PatSnap Insights
AI & Robotics

Foundation models — large pretrained neural networks incorporating transformer architectures, LLMs, and multimodal representations — are fundamentally changing how robots acquire new capabilities. Drawing on more than 40 patent filings from Google, DeepMind, NVIDIA, Naver, Huawei, and Mitsubishi Electric, this article examines the four dominant technical mechanisms enabling zero-shot and few-shot robot task generalization.

PatSnap Insights Team Innovation Intelligence Analysts 11 min read
Share
Reviewed by the PatSnap Insights editorial team ·

The patent landscape: who is building zero-shot robot generalization

Foundation models for zero-shot robot task generalization have attracted sustained, high-volume patent activity from a concentrated group of technology leaders. The dataset surveyed comprises more than 40 filings across multiple jurisdictions — CN, JP, KR, US, EP, WO, and BR — spanning 2018 to 2026, with Google LLC / GDM Holdings accounting for approximately 12 filings, the largest share by a significant margin. Naver / Naver Labs holds 4 filings, while Huawei Technologies, NVIDIA Corporation, and X Development LLC each contribute 3 filings. Academic institutions — including Tsinghua University, Nanjing University, and Zhejiang University — are active alongside emerging industrial filers such as Ping An Technology and Beijing Humanoid Robot Innovation Center.

40+
Patent filings surveyed (2018–2026)
~12
Google LLC / GDM Holdings filings — largest assignee
7
Patent jurisdictions (CN, JP, KR, US, EP, WO, BR)
4
Core technical mechanisms identified

The dominant technical approaches fall into four categories: (1) transformer-based foundation models pre-trained on large demonstration datasets and adapted via meta-learning or in-context conditioning; (2) large language model grounding for free-form natural language instruction following; (3) hierarchical policy architectures where a high-level foundation model generates semantic plans executed by lower-level controllers; and (4) domain adaptation and sim-to-real transfer methods that leverage pretrained representations to bridge simulation and physical environments.

Figure 1 — Patent filing volume by assignee: foundation models for zero-shot robot task generalization (2018–2026)
Patent filing counts by assignee — foundation models for zero-shot robot task generalization 0 3 6 9 12 ~12 4 3 3 3 Google / GDM Holdings Naver / Naver Labs Huawei Technologies NVIDIA Corporation X Development LLC
Google LLC / GDM Holdings leads the dataset with approximately 12 filings, spanning LLM grounding, multi-modal embeddings, hierarchical navigation, and meta-learning — representing the broadest foundation model ecosystem for robotics among all assignees surveyed.
What is zero-shot robot task generalization?

Zero-shot generalization refers to a robot’s ability to perform tasks it was never explicitly trained on, by leveraging representations and priors encoded in a large foundation model during pretraining. The model adapts its policy outputs dynamically based on minimal or no new task-specific data at inference time — without retraining.

Transformer architectures and meta-learning as the generalization backbone

Transformer-based foundation models pre-trained across diverse tasks, combined with meta-learning protocols, form the primary technical backbone for zero-shot robot generalization. Naver Labs has been particularly active in this space, with two closely related patents on transformer-based meta-imitation learning. The core system comprises a transformer-architecture model trained using a two-phase meta-learning procedure: a “meta-train” phase using first demonstrations of each training task, and an “optimize” phase using second demonstrations. Critically, the number of demonstrations per training task is constrained to more than one but fewer than a fixed first predetermined number — explicitly fewer than 10 — enforcing a few-shot regime that forces the model to generalise across task structures rather than memorise specific demonstrations.

Naver Labs’ transformer-based meta-imitation learning system constrains demonstrations to fewer than a fixed first predetermined number (less than 10) per training task, forcing the model to learn task-transferable representations rather than memorising specific demonstrations — a key mechanism for zero-shot robot task generalization.

Google LLC reinforces this direction through meta-imitation and meta-reinforcement learning. Google’s 2025 patent discloses a framework that jointly trains a meta-learning model using both imitation learning (from human-guided demonstrations) and reinforcement learning (from robot trial episodes). The trained meta-learning model can then perform single-shot or few-shot learning on new tasks by conditioning on new demonstrations at inference time. This combination — where pretraining encodes structural task priors and meta-learning enables rapid conditioning on new task context — is characteristic of the zero-shot generalization paradigm: the model does not need to be retrained but rather adapts its policy outputs dynamically based on minimal new evidence.

“The model does not need to be retrained but rather adapts its policy outputs dynamically based on minimal new evidence — a defining feature of zero-shot generalization.”

The use of a universal tokenized action interface further extends transformer generalization. GDM Holdings’ 2026 patent describes a sequence-modeling neural network — specifically a large transformer model — conditioned jointly on tokenized goal images and tokenized observation images. The architecture produces discrete output tokens from a shared vocabulary, allowing the same model to execute multiple distinct manipulation tasks across varying robot morphologies, degrees of freedom, and control frequencies without task-specific retraining. The patent explicitly notes that the system constitutes a “self-improving general policy” as opposed to task-specific controllers, a defining feature of zero-shot generalization.

Explore the full patent landscape for foundation models in robotics with PatSnap Eureka.

Explore patent data in PatSnap Eureka →
Figure 2 — Four core technical mechanisms enabling zero-shot robot task generalization
Four core mechanisms enabling zero-shot robot task generalization: transformer meta-learning, LLM grounding, hierarchical policy architecture, sim-to-real transfer Transformer Meta-Learning LLM Grounding Hierarchical Policy Sim-to-Real Transfer Zero-Shot Generalization Naver, Google Google, GDM Google, Mitsubishi DeepMind, Huawei
The four mechanisms are complementary rather than mutually exclusive — most deployed systems combine transformer meta-learning with at least one of LLM grounding, hierarchical policy separation, or sim-to-real adaptation.

LLM grounding: natural language as a zero-shot task interface

Large language models enable zero-shot task generalization by grounding free-form natural language instructions to robot skills without requiring per-task training data. Google LLC’s 2025 patent on natural language robot control discloses a two-stage grounding process: the LLM first processes a free-form natural language instruction to generate a probability distribution over possible interpretations — the “task-grounding measure” — and then the system cross-references current environmental state data to determine a “world-grounding measure” reflecting the probability of the skill being successful given the current scene. A robot skill is executed only when both grounding measures jointly satisfy a threshold.

Google LLC’s natural language robot control patent (2025) uses a dual-grounding mechanism: a task-grounding measure (probability distribution over instruction interpretations) and a world-grounding measure (probability of skill success given the current scene), executing a skill only when both jointly satisfy a threshold — enabling zero-shot robot task generalization without per-task training data.

GDM Holdings advances this direction in its 2025 patent on fast and slow adaptation for language model predictive control, which introduces an iterative in-context learning loop between a human and a robot. A generative foundation model processes human natural language feedback to predict candidate future dialog turns — each including a predicted robot action and a predicted human response — and selects actions from this candidate future. The interaction session is subsequently used as fine-tuning data for the generative model, creating a “fast and slow” adaptation dynamic: rapid in-context adaptation during deployment and slower fine-tuning between sessions. This architecture allows a robot to be taught novel tasks through conversational interaction without manual policy redesign.

Chain-of-thought reasoning has also been proposed as a mechanism to bridge high-level language instructions and low-level robot control. A 2026 patent from Beijing Jijia Vision Technology describes constructing a chain-of-reasoning dataset to fine-tune a pretrained policy network, producing a strategy prediction model that outputs both interpretable chain-of-reasoning text and executable action data. The reward functions used during joint optimisation — affordance reward, trajectory consistency reward, and output format reward — are specifically designed to promote cross-task and cross-scene generalization, reducing error rates in multi-object or complex instruction scenarios. As noted by IEEE, chain-of-thought approaches have become a leading technique for improving reliability in language-conditioned robot control systems.

Key finding

LLMs enable zero-shot instruction grounding by mapping free-form natural language to feasible robot skills without per-task training data. Google LLC’s dual task-grounding and world-grounding measures provide a principled threshold mechanism that prevents the robot from attempting skills that are linguistically plausible but physically infeasible in the current scene.

Hierarchical policy architectures: decoupling planning from motor execution

Hierarchical policy architectures enable task generalization by separating a foundation model operating at a high semantic level from low-level controllers that execute fine-grained motor primitives — allowing the high-level model to adapt to new goals without retraining the entire policy stack. Google LLC’s 2025 patent on robot navigation formalises this two-tier structure: a high-level policy model trained via supervised learning on real-world observations and ground-truth navigation paths generates semantic navigation decisions, while a separately trained low-level policy model converts these high-level actions into precise, obstacle-avoiding motor commands.

This paradigm is extended to manipulation tasks in Google LLC’s 2024 patent on environment-conditioned action sequences, where an environment-conditioned action sequence prediction model — combining a CNN for visual encoding and a sequence-to-sequence transformer for action planning — determines not only which actions to perform but also their ordering, conditioned on the current visual scene. The system’s predicted action sequence dynamically changes based on the observed state (for example, whether a cabinet is open or closed), demonstrating the foundation model’s ability to reason about preconditions and adapt task plans to new environmental configurations without explicit task-specific programming.

Mitsubishi Electric’s 2025 patent on sequential task generalization uses a pre-trained learning module encoding a dictionary of motion primitives combined with a graph-search-based planning module, enabling compositional zero-shot generalization by recombining existing skill primitives in new sequences rather than training task-specific policies from scratch.

Mitsubishi Electric’s 2025 patent addresses long-horizon sequential task generalization through a pre-trained learning module that encodes a dictionary of motion primitives, combined with a graph-search-based planning module. Given a new task defined only by initial and goal states, the system searches the graph for a feasible path and selects motion primitives from the dictionary — parameterized as Dynamic Movement Primitives (DMPs) — to compose the task trajectory. This approach generalizes to new tasks by recombining existing skill primitives in new sequences rather than training task-specific policies from scratch, a form of compositional zero-shot generalization. This compositional approach is consistent with broader trends in modular robot learning discussed by WIPO in its technology trend reports on AI and robotics.

Search and analyse hierarchical robot policy patents across all major jurisdictions in PatSnap Eureka.

Search robot policy patents in PatSnap Eureka →

Sim-to-real transfer and domain adaptation

Sim-to-real transfer remains a primary engineering challenge for deploying foundation models on physical robots: simulation enables large-scale training, but the domain gap between simulated and real environments can cause policies that perform well in simulation to fail in deployment. Multiple patent approaches address this bottleneck directly. The National University of Defense Technology’s 2024 patent trains a semantic abstraction neural network using adversarial domain adaptation to align simulated and real images in a shared semantic feature space, then trains a reinforcement learning policy entirely in simulation before deploying it to the real robot. By reducing the original high-dimensional image space to a semantic representation, the method also decreases state space complexity and improves transfer efficiency.

Google LLC’s 2024 patent on robot action ML model training using multi-modal embeddings addresses the sim-to-real gap through feature-level domain adaptation during training: a variational information bottleneck (VIB) objective is applied to the encoder layers of the robot action ML model to force domain-invariant feature extraction. Multiple modality-specific action models are trained in parallel, and their outputs are dynamically weighted based on analysis of the embeddings generated during inference, allowing the system to adapt its sensory fusion strategy to environmental conditions at deployment time. Research published through Nature has similarly highlighted the variational information bottleneck as a principled approach to learning representations that generalise across domains.

DeepMind’s approach, described in two related patents, chains a simulation-trained deep neural network (with fixed weights) to a real-world-trained DNN. The simulation-trained model provides learned features that are shared with the real-world DNN, which is initialised to replicate the simulation policy and then fine-tuned with real-world experience while keeping the simulation network frozen. This architectural constraint prevents catastrophic forgetting of simulation-derived representations, enabling the pretrained foundation to serve as a stable generalization backbone across real-world task variants.

Huawei Technologies contributes a complementary approach through reusable skill options. Their 2023 and 2025 patents describe an RL agent that extracts minimally correlated feature-based pseudo-rewards from a learned policy, trains corresponding option sub-policies to maximise each feature reward, and then reuses these learned options as compositional primitives when learning new tasks. This mechanism provides a structured form of knowledge transfer that avoids retraining from scratch for each new task, functioning as a form of zero-shot initialisation for novel task policies. The OECD‘s AI policy framework has identified knowledge transfer and compositional skill reuse as critical capabilities for scalable industrial robot deployment.

“DeepMind’s architectural constraint — keeping the simulation network frozen during real-world fine-tuning — prevents catastrophic forgetting of simulation-derived representations, enabling the pretrained foundation to serve as a stable generalization backbone.”

NVIDIA Corporation has prioritised sample-efficient model-based and model-free hybrid approaches through its Guided Uncertainty-Aware Policy Optimisation patents (2021 and 2024), where perceptual uncertainty drives dynamic switching between model-based and model-free strategies to handle novel environments efficiently. X Development LLC contributes pre-trained model deployment for physical environment understanding, with a 2022 patent in which a pretrained model infers candidate robot base positions from height maps without task-specific retraining — demonstrating that even spatial configuration tasks can be addressed through foundation model inference rather than task-specific learning.

For R&D teams tracking this space, the PatSnap R&D intelligence platform and PatSnap IP analytics provide structured access to the full patent datasets described in this article, including cross-jurisdictional filing analysis and assignee benchmarking.

Frequently asked questions

Foundation models for zero-shot robot task generalization — key questions answered

Still have questions? Let PatSnap Eureka answer them for you.

Ask PatSnap Eureka for a deeper answer →

References

  1. Systems and Methods for Learning Sequences in Robot Tasks to Generalize to New Tasks — Mitsubishi Electric Corporation, 2025
  2. Systems and Methods for Selecting Actions — DeepMind Technologies (Yuanhui Technology), 2023
  3. Robot Navigation Using High-Level Policy Model and Trained Low-Level Policy Model — Google LLC, 2025
  4. Robot Action ML Model Training Using Multi-Modal Embeddings — Google LLC, 2024
  5. Model Training Method, Robot Control Method, Apparatus, and Electronic Device — Beijing Jijia Vision Technology Co., Ltd., 2026
  6. Transformer-based meta-imitation learning for robots — Naver Labs Corporation, 2023
  7. Natural Language Control of a Robot — Google LLC, 2025
  8. Training System, Method and Navigation Robot for Visual Navigation — Naver Corporation, 2022
  9. Fast and Slow Adaptation for Language Model Predictive Control and/or Guidance — GDM Holding LLC, 2025
  10. Deep Machine Learning Method and Apparatus for Robotic Grasping — Google LLC, 2019
  11. Efficient Adaptation of Robot Control Policies for New Tasks Using Meta-Imitation Learning and Meta-Reinforcement Learning — Google LLC, 2025
  12. Semantic Domain Adaptation-Based Sim-to-Real Transfer Learning Method and System for Robot Skills — National University of Defense Technology, 2024
  13. Systems and Methods for Selecting Actions — DeepMind Technologies (Yuanhui Technology), 2026
  14. Guided Uncertainty-Aware Policy Optimization: Combining Model-Free and Model-Based Strategies for Sample-Efficient Learning — NVIDIA Corporation, 2021
  15. Guided Uncertainty-Aware Policy Optimization: Combining Model-Free and Model-Based Strategies for Sample-Efficient Learning — NVIDIA Corporation, 2024
  16. Systems and Methods for Learning Reusable Options to Transfer Knowledge Between Tasks — Huawei Technologies Co., Ltd., 2023
  17. Systems and Methods for Learning Reusable Options to Transfer Knowledge Between Tasks — Huawei Technologies Co., Ltd., 2025
  18. Controlling Agents Using Tokenized Goal Images — GDM Holdings LLC, 2026
  19. Determining Environment-Conditioned Action Sequences for Robotic Tasks — Google LLC, 2024
  20. Transformer-based meta-imitation learning of robots — Naver Corporation, 2024
  21. Robot transformer-based meta-imitation learning — Naver Labs Corporation, 2022
  22. Methods and Systems for Support Policy Learning — Huawei Technologies Co., Ltd., 2021
  23. Robot Base Position Planning — X Development LLC, 2022
  24. Self-Supervised Robotic Object Interaction — Google LLC, 2024
  25. WIPO — World Intellectual Property Organization: Technology Trends in AI and Robotics
  26. IEEE — Institute of Electrical and Electronics Engineers: Robotics and Automation Research
  27. Nature — Variational Information Bottleneck and Domain-Invariant Representations
  28. OECD — AI Policy Framework: Knowledge Transfer and Compositional Skill Reuse in Robotics

All data and statistics in this article are sourced from the references above and from PatSnap‘s proprietary innovation intelligence platform.

Your Agentic AI Partner
for Smarter Innovation

PatSnap fuses the world’s largest proprietary innovation dataset with cutting-edge AI to
supercharge R&D, IP strategy, materials science, and drug discovery.

Book a demo