Book a demo

Cut patent&paper research from weeks to hours with PatSnap Eureka AI!

Try now

Foundation models for zero-shot robotics: 40+ patents

Foundation Models for Zero-Shot Robot Task Generalization — PatSnap Insights
Robotics & AI

Foundation models are reshaping what robots can do — not by training them on every possible task, but by giving them the capacity to reason, adapt, and act in situations they have never encountered before. This analysis draws on more than 40 patent filings from Google, Naver, NVIDIA, Huawei, and others to map the mechanisms driving zero-shot robot task generalization.

PatSnap Insights Team Innovation Intelligence Analysts 9 min read
Share
Reviewed by the PatSnap Insights editorial team ·

The patent landscape: who is building zero-shot robot generalization

The foundation model robotics patent landscape is dominated by a small number of technology companies making large, coordinated bets. Based on a dataset of more than 40 filings across CN, JP, KR, US, EP, WO, and BR jurisdictions spanning 2018 to 2026, Google LLC / GDM Holdings leads with approximately 12 filings — covering LLM grounding, multi-modal embeddings, hierarchical navigation, meta-learning, and environment-conditioned action sequencing. This breadth signals a deliberate strategy to own the full foundation model stack for robotics, not merely a single technique.

40+
Patent filings analysed (2018–2026)
~12
Google LLC / GDM Holdings filings
7+
Jurisdictions (CN, JP, KR, US, EP, WO, BR)
4
Dominant technical approaches identified

Beyond Google, Naver / Naver Labs has filed 4 patents, establishing a focused IP position around transformer-based meta-imitation learning active in both Japanese and Korean jurisdictions. Huawei Technologies, NVIDIA Corporation, and X Development LLC each account for 3 filings. Academic and industrial filers — including Tsinghua University, Nanjing University, Zhejiang University, Ping An Technology, and Beijing Humanoid Robot Innovation Center — are contributing innovations in simulation fine-tuning, state representation learning, and cross-modal world models, indicating a broadening research ecosystem.

Figure 1 — Patent filing volume by assignee: foundation models for zero-shot robot generalization (2018–2026)
Patent filing volume by assignee in foundation models for zero-shot robot task generalization (2018–2026) 0 4 8 12 ~12 4 3 3 3 Google / GDM Naver / Naver Labs Huawei Technologies NVIDIA Corporation X Development LLC Patent Filings
Google LLC / GDM Holdings leads the dataset with approximately 12 filings; Naver, Huawei, NVIDIA, and X Development each hold 3–4 filings, with a long tail of academic and industrial contributors.

The dominant technical approaches fall into four categories: transformer-based foundation models adapted via meta-learning; LLM grounding for natural language instruction following; hierarchical policy architectures; and domain adaptation for sim-to-real transfer. Each is examined in the sections below.

A patent analysis of more than 40 filings spanning 2018–2026 across CN, JP, KR, US, EP, WO, and BR jurisdictions identifies four dominant technical approaches to zero-shot robot task generalization: transformer-based meta-learning, LLM grounding, hierarchical policy architectures, and sim-to-real domain adaptation.

Transformer architectures and meta-learning: the generalization backbone

Transformer-based foundation models pre-trained across diverse tasks are the dominant backbone for zero-shot robot generalization, because they encode structural task priors that allow rapid policy adaptation from minimal new evidence at inference time — without retraining. Naver Labs has been particularly active in this space, filing two closely related patents on transformer-based meta-imitation learning. The core system comprises a transformer model trained using a two-phase meta-learning procedure: a “meta-train” phase using first demonstrations of each training task, and an “optimize” phase using second demonstrations. Critically, the number of demonstrations per training task is constrained to more than one but fewer than a fixed first predetermined number — explicitly fewer than 10 — enforcing a few-shot regime that forces the model to generalise across task structures rather than memorise specific demonstrations.

What is meta-imitation learning?

Meta-imitation learning trains a model on a diverse set of tasks so that it learns how to learn from demonstrations, rather than learning any single task. At deployment, the model conditions on a small number of new demonstrations to rapidly adapt its policy to an unseen task — without gradient updates or retraining.

Google LLC reinforces this direction through a framework that jointly trains a meta-learning model using both imitation learning (from human-guided demonstrations) and reinforcement learning (from robot trial episodes). The trained meta-learning model can then perform single-shot or few-shot learning on new tasks by conditioning on new demonstrations at inference time. This combination — pretraining encodes structural task priors, meta-learning enables rapid conditioning on new task context — is characteristic of the zero-shot generalization paradigm: the model adapts its policy outputs dynamically based on minimal new evidence, without being retrained.

“A single large transformer model conditioned on tokenized goal images can execute multiple distinct manipulation tasks across different robot morphologies and control frequencies — without task-specific retraining.”

The use of a universal tokenized action interface extends transformer generalization further. GDM Holdings’ 2026 patent describes a sequence-modeling neural network — specifically a large transformer model — conditioned jointly on tokenized goal images and tokenized observation images. The architecture produces discrete output tokens from a shared vocabulary, allowing the same model to execute multiple distinct manipulation tasks (varying robot morphologies, degrees of freedom, control frequencies) without task-specific retraining. The patent explicitly characterises the system as a “self-improving general policy” as opposed to task-specific controllers — a defining feature of zero-shot generalization.

Explore the full patent landscape for foundation model robotics and zero-shot generalization in PatSnap Eureka.

Search Patents in PatSnap Eureka →
Figure 2 — Zero-shot robot generalization: four dominant technical mechanisms
Four dominant mechanisms enabling zero-shot robot task generalization in foundation model robotics Transformer Meta-Learning Naver, Google LLM Grounding Google, GDM Hierarchical Policies Google, Mitsubishi Sim-to-Real Transfer Google, DeepMind Zero-Shot Robot Generalization
The four dominant technical mechanisms identified across 40+ patent filings form a pipeline from pretrained representations to zero-shot deployment on physical robots.

Naver Labs’ transformer-based meta-imitation learning patents constrain training to fewer than 10 demonstrations per task, enforcing a few-shot regime that forces the model to learn task-transferable policies rather than memorise specific demonstrations — a key mechanism for zero-shot robot task generalization.

LLM grounding and natural language task specification

Large language models enable zero-shot robot task generalization by mapping free-form natural language instructions to executable robot skills without requiring per-task training data — the broad linguistic and world knowledge encoded during LLM pretraining substitutes for task-specific demonstration collection. Google LLC’s patent on Natural Language Control of a Robot discloses a two-stage grounding process: the LLM first processes a free-form natural language instruction to generate a probability distribution over possible interpretations (the “task-grounding measure”), and then the system cross-references current environmental state data to determine a “world-grounding measure” reflecting the probability of the skill being successful given the current scene. A robot skill is executed only when both grounding measures jointly satisfy a threshold — a dual-grounding mechanism that prevents hallucinated or unsafe actions in novel environments.

GDM Holdings advances this direction with a “fast and slow” adaptation architecture. A generative foundation model processes human natural language feedback to predict candidate future dialog turns — each including a predicted robot action and a predicted human response — and selects actions from this candidate future. The interaction session is then used as fine-tuning data for the generative model, creating rapid in-context adaptation during deployment and slower fine-tuning between sessions. This architecture allows a robot to be taught novel tasks through conversational interaction without manual policy redesign, as described in PatSnap‘s analysis of the GDM Holdings filing.

Chain-of-thought reasoning has also emerged as a mechanism to bridge high-level language instructions and low-level robot control. A 2026 patent from Beijing Jijia Vision Technology describes constructing a chain-of-reasoning dataset to fine-tune a pretrained policy network, producing a strategy prediction model that outputs both interpretable chain-of-reasoning text and executable action data. The reward functions used during joint optimization — affordance reward, trajectory consistency reward, and output format reward — are specifically designed to promote cross-task and cross-scene generalization. According to WIPO‘s analysis of AI patent trends, natural language grounding for robotic control is among the fastest-growing sub-categories in the broader AI patent space.

Key finding

Google LLC’s dual-grounding mechanism — combining a task-grounding measure (probability distribution over instruction interpretations) and a world-grounding measure (probability of skill success given the current scene) — enables zero-shot LLM-driven robot control without any per-task training data. A skill executes only when both measures jointly satisfy a threshold.

Google LLC’s Natural Language Control of a Robot patent (2025) discloses a dual-grounding mechanism for zero-shot LLM robot control: a task-grounding measure (probability distribution over instruction interpretations) combined with a world-grounding measure (probability of skill success given the current scene), where a skill executes only when both measures jointly satisfy a threshold.

Hierarchical policy architectures: separating planning from execution

Hierarchical policy architectures enable zero-shot task generalization by decoupling semantic planning — where foundation models excel — from fine-grained motor execution, so that the high-level model can adapt to new goals without retraining the entire policy stack. Google LLC’s patent on Robot Navigation Using High-Level Policy Model and Trained Low-Level Policy Model formalises this two-tier structure for mobile robot navigation: a high-level policy model trained via supervised learning on real-world observations and ground-truth navigation paths generates semantic navigation decisions, while a separately trained low-level policy model converts these high-level actions into precise, obstacle-avoiding motor commands.

This paradigm is extended to manipulation in Google LLC’s patent on Determining Environment-Conditioned Action Sequences for Robotic Tasks, where a model combining a CNN for visual encoding and a sequence-to-sequence transformer for action planning determines not only which actions to perform but also their ordering, conditioned on the current visual scene. The system’s predicted action sequence dynamically changes based on observed state — for example, whether a cabinet is open or closed — demonstrating the foundation model’s ability to reason about preconditions and adapt task plans to new environmental configurations without explicit task-specific programming. Research published by IEEE on robot learning architectures similarly highlights hierarchical decomposition as a key enabler of generalisation beyond training distributions.

Mitsubishi Electric addresses long-horizon sequential task generalization through a pre-trained learning module that encodes a dictionary of motion primitives, combined with a graph-search-based planning module. Given a new task defined only by initial and goal states, the system searches the graph for a feasible path and selects motion primitives — parameterized as Dynamic Movement Primitives (DMPs) — to compose the task trajectory. This approach generalises to new tasks by recombining existing skill primitives in new sequences rather than training task-specific policies from scratch: a form of compositional zero-shot generalization. Huawei Technologies takes a related approach through reusable skill options, training an RL agent to extract minimally correlated feature-based pseudo-rewards and corresponding option sub-policies that can be reused as compositional primitives when learning new tasks.

Track hierarchical robot policy patents and identify white-space opportunities with PatSnap Eureka’s AI-powered landscape analysis.

Analyse Robot Policy Patents in PatSnap Eureka →

Mitsubishi Electric’s 2025 patent on learning sequences in robot tasks describes a pre-trained dictionary of Dynamic Movement Primitives (DMPs) combined with a graph-search planning module, enabling zero-shot generalization to new tasks by recombining existing skill primitives in new sequences — without training task-specific policies from scratch.

Sim-to-real transfer: closing the deployment gap for foundation model robotics

Sim-to-real transfer remains the primary engineering bottleneck for deploying foundation models on physical robots, because large-scale training is only feasible in simulation — yet differences in visual appearance, physics, and sensor noise can cause significant policy degradation when the robot moves to the real world. The National University of Defense Technology’s 2024 patent addresses this by training a semantic abstraction neural network using adversarial domain adaptation to align simulated and real images in a shared semantic feature space, then training a reinforcement learning policy entirely in simulation before deploying it to the real robot. By reducing the original high-dimensional image space to a semantic representation, the method also decreases state space complexity and improves transfer efficiency.

Google LLC’s patent on Robot Action ML Model Training Using Multi-Modal Embeddings addresses the sim-to-real gap through feature-level domain adaptation: a variational information bottleneck (VIB) objective is applied to encoder layers to force domain-invariant feature extraction. Multiple modality-specific action models are trained in parallel, and their outputs are dynamically weighted based on analysis of embeddings generated during inference, allowing the system to adapt its sensory fusion strategy to environmental conditions at deployment time. The NIST framework for AI robustness testing similarly identifies domain shift as a primary reliability risk for deployed AI systems.

DeepMind (Instinct Technologies / Yuanhui Technology) contributes a complementary architectural constraint: chaining a simulation-trained DNN (with fixed weights) to a real-world-trained DNN. The simulation-trained model provides learned features shared with the real-world DNN, which is initialized to replicate the simulation policy and then fine-tuned with real-world experience while keeping the simulation network frozen. This constraint prevents catastrophic forgetting of simulation-derived representations, enabling the pretrained foundation to serve as a stable generalization backbone across real-world task variants. NVIDIA Corporation prioritises sample efficiency in novel environments through a Guided Uncertainty-Aware Policy Optimization approach, where perceptual uncertainty drives dynamic switching between model-based and model-free strategies — with active filings in both 2021 and 2024 indicating sustained investment. The PatSnap resources hub provides further context on AI and robotics patent strategy.

Figure 3 — Sim-to-real transfer approaches by assignee and mechanism
Sim-to-real transfer mechanisms for foundation model robotics: semantic domain adaptation, variational information bottleneck, frozen simulation DNN, and uncertainty-aware policy optimization NUDT Google DeepMind NVIDIA Semantic domain adaptation — adversarial alignment in shared feature space Variational information bottleneck (VIB) for domain-invariant feature extraction Frozen simulation DNN chained to fine-tuned real-world DNN Uncertainty-aware model-based / model-free switching (GUAPO)
Four distinct sim-to-real transfer strategies are patented across the dataset, each targeting a different aspect of the domain gap: visual alignment, feature invariance, architectural constraint, and uncertainty-driven exploration.

DeepMind’s sim-to-real transfer architecture (Instinct Technologies / Yuanhui Technology, 2023–2026) chains a simulation-trained DNN with fixed weights to a real-world-trained DNN, preventing catastrophic forgetting of simulation-derived representations while enabling fine-tuning on real-world experience — a key mechanism for deploying foundation models on physical robots.

Frequently asked questions

Foundation models and zero-shot robot task generalization — key questions answered

Still have questions? Let PatSnap Eureka answer them for you.

Ask PatSnap Eureka for a Deeper Answer →

References

  1. Systems and Methods for Learning Sequences in Robot Tasks to Generalize to New Tasks — Mitsubishi Electric Corporation, 2025
  2. Systems and Methods for Selecting Actions — DeepMind Technologies (Yuanhui Technology), 2023
  3. Robot Navigation Using High-Level Policy Model and Trained Low-Level Policy Model — Google LLC, 2025
  4. Robot Action ML Model Training Using Multi-Modal Embeddings — Google LLC, 2024
  5. Model Training Method, Robot Control Method, Apparatus, and Electronic Device — Beijing Jijia Vision Technology Co., Ltd., 2026
  6. Transformer-based meta-imitation learning for robots — Naver Labs Corporation, 2023
  7. Natural Language Control of a Robot — Google LLC, 2025
  8. Fast and Slow Adaptation for Language Model Predictive Control and/or Guidance — GDM Holdings LLC, 2025
  9. Deep Machine Learning Method and Apparatus for Robotic Grasping — Google LLC, 2019
  10. Efficient Adaptation of Robot Control Policies for New Tasks Using Meta-Imitation Learning and Meta-Reinforcement Learning — Google LLC, 2025
  11. Semantic Domain Adaptation-Based Sim-to-Real Transfer Learning Method and System for Robot Skills — National University of Defense Technology, 2024
  12. Systems and Methods for Selecting Actions — DeepMind Technologies (Yuanhui Technology), 2026
  13. Guided Uncertainty-Aware Policy Optimization: Combining Model-Free and Model-Based Strategies for Sample-Efficient Learning — NVIDIA Corporation, 2021
  14. Guided Uncertainty-Aware Policy Optimization — NVIDIA Corporation, 2024
  15. Systems and Methods for Learning Reusable Options to Transfer Knowledge Between Tasks — Huawei Technologies Co., Ltd., 2023
  16. Systems and Methods for Learning Reusable Options to Transfer Knowledge Between Tasks — Huawei Technologies Co., Ltd., 2025
  17. Controlling Agents Using Tokenized Goal Images — GDM Holdings LLC, 2026
  18. Determining Environment-Conditioned Action Sequences for Robotic Tasks — Google LLC, 2024
  19. Robot transformer-based meta-imitation learning — Naver Labs Corporation, 2022
  20. Transformer-based meta-imitation learning of robots — Naver Corporation, 2024
  21. Training System, Method and Navigation Robot for Visual Navigation — Naver Corporation, 2022
  22. Methods and Systems for Support Policy Learning — Huawei Technologies Co., Ltd., 2021
  23. Robot Base Position Planning — X Development LLC, 2022
  24. Self-Supervised Robotic Object Interaction — Google LLC, 2024
  25. WIPO — World Intellectual Property Organization: AI Patent Trends
  26. IEEE — Institute of Electrical and Electronics Engineers: Robot Learning Research
  27. NIST — National Institute of Standards and Technology: AI Robustness Framework

All data and statistics in this article are sourced from the references above and from PatSnap‘s proprietary innovation intelligence platform.

Your Agentic AI Partner
for Smarter Innovation

PatSnap fuses the world’s largest proprietary innovation dataset with cutting-edge AI to
supercharge R&D, IP strategy, materials science, and drug discovery.

Book a demo