To start using PatSnap Eureka, click the verification button in the email we sent to .
This helps keep your account secure. Haven't received it? Check your spam folder.
Patent Drafting Analysis of DeepMind Technologies Limited’s Multi-Agent Reinforcement Learning with Matchmaking Policies | US 11,627,165 B2
Patent Drafting Analysis of DeepMind Technologies Limited’s Multi-Agent Reinforcement Learning with Matchmaking Policies | US 11,627,165 B2
IP Drafting Analysis · US 11,627,165 B2
Patent Drafting Analysis of DeepMind Technologies Limited's Multi-Agent Reinforcement Learning with Matchmaking Policies | US 11,627,165 B2
A structural and strategic analysis of US 11,627,165 B2, examining claim architecture across method, system, and CRM formats, drafting quality signals, §101 eligibility exposure, and critical prosecution gaps in DeepMind's matchmaking-based RL training system.
US 11,627,165 B2Filed: Jan 24, 2020Granted: Apr 11, 2023G06N 3/08H04L 9/40G06K 9/62
Published byPatSnap Insights Team · · 12 min read Verified by PatSnap Eureka Data
Overview
Structural Overview
The detailed description dominates at approximately 50% of total words (~4,800 words), reflecting substantive technical depth in explaining the matchmaking policy training loop, though the background section is lean at ~680 words, offering minimal prior art context. The claim set comprises 30 claims across 3 independent claims (method, system, and CRM) with 27 dependents, yielding a ratio of 9:1 — well above norms for AI/software patents. The three drawing sheets are minimalist, covering a system architecture (FIG. 1) and two simple flow diagrams (FIG. 2, FIG. 3), which provide only coarse-grained structural support for the detailed claim limitations.
Section Word Distribution
↗ Click bars to explore
Figure Inventory — 3 Sheets
Figure
Description
Role
FIG. 1
Shows the overall reinforcement learning system 100, including agents 102A-N, environment 104, policy neural network 110, training engine 120, policy data 140, training data 130, labeled task instances 132, learner policies 142A-M with respective matchmaking policies 144A-M, and fixed policy 152.Search in Eureka ↗
System architecture
FIG. 2
Flow diagram of example process 200 for training a policy neural network, showing three sequential steps: maintain pool of candidate action selection policies (202), maintain matchmaking policies (204), and train the policy neural network (206).Search in Eureka ↗
Flow diagram
FIG. 3
Flow diagram of example process 300 for updating learner policies based on training data, showing per-learner-policy steps: select one or more policies (302), generate training data for the learner policy (304), and update the respective set of policy parameters (306).Search in Eureka ↗
Flow diagram
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Claims
Claim Architecture Analysis
The patent contains 3 independent claims: Claim 1 (method), Claim 20 (CRM/non-transitory storage media), and Claim 21 (system), providing tripartite enforcement coverage across method, storage medium, and apparatus formats. The dependent:independent ratio of 9:1 significantly exceeds the software/AI industry norm of 4–8:1, reflecting a deliberately layered fallback strategy. Notably, the dependent claims are substantially mirrored across all three independent claims (e.g., Claims 2–19 depend from Claim 1, while Claims 22–30 roughly parallel Claims 2–10 for Claims 21/20), concentrating fallback depth on the method claim.
Core inventive concept: The claims address the challenge of training a policy neural network to control agents performing tasks in multi-agent environments where the state and strategic spaces are extremely large — a problem that arises because a single training opponent set cannot cover the diversity of strategies needed. The solution, as expressed across Claims 1, 20, and 21, is maintaining a pool of candidate action selection policies where each learner policy has its own "matchmaking policy" defining a probability distribution over the pool, allowing each learner to be trained against different, strategically selected opponents, thereby encouraging exploration of diverse state and strategy spaces.
Independent Claim Dissection
Claim
Preamble
Transition
Key Body Elements
Claim 1
A method of training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment
comprising
maintaining a pool of candidate action selection policies including learner policies and fixed policies; maintaining per-learner-policy matchmaking policies defining distributions over the pool; at each training iteration for each learner policy: selecting policies via matchmaking, generating training data via agent interaction, updating policy parameters via RL loss function; determining criteria for converting a learner to a fixed policy; generating new fixed policy with same parameter valuesSearch prior art ↗
Claim 20
One or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment
comprising
operations mirroring Claim 1: maintaining candidate action selection pool with learner and fixed policies; maintaining per-learner matchmaking policies; iterative selection via matchmaking, training data generation, RL-based parameter update; criteria-based conversion of learner to fixed policy; new fixed policy generation with same parameter valuesSearch prior art ↗
Claim 21
A system comprising one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to perform operations for training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment
comprising
operations mirroring Claims 1 and 20: maintaining candidate action selection pool with learner and fixed policies; maintaining per-learner matchmaking policies defining distributions; iterative matchmaking-based policy selection, training data generation, RL parameter updates; criteria-based learner-to-fixed conversion; new fixed policy generationSearch prior art ↗
Claim Dependency Tree
1 Method: training policy neural network via pool of candidate policies + per-learner matchmaking distributions + RL parameter updates + learner-to-fixed conversionSearch Claim 1 prior art ↗
2 Adds: matchmaking policies for two or more learner policies are different from each otherSearch in Eureka ↗
3 Further: learner policies each assigned a respective type from plurality of types, each type associated with different matchmaking policySearch in Eureka ↗
4 Adds: matchmaking policy for at least one learner is uniform across learner policies of same type and zero for different types and fixed policiesSearch in Eureka ↗
5 Adds: matchmaking policy for at least one learner is uniform across all learner policies and zero for fixed policiesSearch in Eureka ↗
6 Adds: matchmaking policy for at least one learner is uniform across all policies in the poolSearch in Eureka ↗
7 Adds: RL loss function depends on plurality of hyperparameters; hyperparameter values different for two or more learner policiesSearch in Eureka ↗
8 Further: hyperparameters include one or more hyperparameters of a RL algorithm used in trainingSearch in Eureka ↗
9 Further: hyperparameters include internal reward hyperparameters defining whether RL loss depends on internal reward and how it is computedSearch in Eureka ↗
10 Adds: one or more fixed policies defined by policy parameter values determined through supervised learning on labeled task instancesSearch in Eureka ↗
11 Further: supervised learning comprises first supervised learning using first training data and second supervised learning using only selected portion with threshold performanceSearch in Eureka ↗
12 Adds: determining criteria satisfied comprises determining a predetermined number of training iterations have been completedSearch in Eureka ↗
13 Adds: in response to conversion criteria satisfied, setting policy parameters of particular learner based on current values of other policies in poolSearch in Eureka ↗
14 Further: setting policy parameters to new set determined based on current sets of values for policy parameters defining one or more other policies in poolSearch in Eureka ↗
15 Further: in response, modifying hyperparameters of the RL loss function for the particular learner policySearch in Eureka ↗
16 Further: in response, modifying the matchmaking policy for the particular learner policySearch in Eureka ↗
17 Adds: for at least one selected policy, updating its policy parameters by training on training data through RL to optimize RL loss functionSearch in Eureka ↗
18 Adds: determining criteria satisfied comprises determining agent controlled by learner has attained threshold level of performance on the particular taskSearch in Eureka ↗
19 Adds: matchmaking policy specifies higher-performing learner policies are more likely to be selected than lower-performing onesSearch in Eureka ↗
20 CRM: non-transitory storage media encoding operations identical in structure to Claim 1 method for training policy neural network with matchmaking-based learner-opponent selectionSearch Claim 20 prior art ↗
21 System: one or more computers + storage devices performing operations structurally identical to Claim 1 for training policy neural network with matchmaking policiesSearch Claim 21 prior art ↗
22 Adds: matchmaking policies for two or more learner policies are different (parallels Claim 2)Search in Eureka ↗
23 Further: learner policies each assigned type from plurality; type associated with different matchmaking policy (parallels Claim 3)Search in Eureka ↗
24 Adds: type-based uniform matchmaking, zero for other types and fixed policies (parallels Claim 4)Search in Eureka ↗
25 Adds: uniform across all learner policies, zero for fixed (parallels Claim 5)Search in Eureka ↗
26 Adds: uniform across all policies in pool (parallels Claim 6)Search in Eureka ↗
27 Adds: RL loss depends on hyperparameters different across learner policies (parallels Claim 7)Search in Eureka ↗
28 Further: hyperparameters include RL algorithm hyperparameters (parallels Claim 8)Search in Eureka ↗
30 Adds: in response to conversion criteria, setting policy parameters of learner based on current values of other policies in pool (parallels Claim 13)Search in Eureka ↗
Metric
This Application
Software / AI Industry Norm
Total claims
30
15 – 25
Independent claim count
3
2 – 4
Dependent : Independent ratio
9.0 : 1
4 – 8 : 1
Method claims present?
Yes — Claim 1
Common
System / apparatus claims?
Yes — Claim 21
Common
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Drafting Quality
Drafting Quality Signals
The patent demonstrates strong claim architecture through its tripartite independent claim structure (Claims 1, 20, 21) and a notably high dependent claim ratio of 9:1, providing layered fallback against validity attacks — particularly through Claims 3–9 which add distinct technical limitations around policy typing, hyperparameter diversity, and internal reward schemes. The primary quality weakness lies in §101 eligibility exposure: the independent claims recite entirely abstract computational operations without tying to any specific hardware architecture or physical effect, leaving the claims vulnerable to Alice Step 1 challenge.
✅
Antecedent Basis
Antecedent basis is well-managed throughout the claim set. Claim 1 introduces "a pool of candidate action selection policies" and consistently references it as "the pool" in subsequent limitations. Similarly, "a respective matchmaking policy" is introduced then correctly referenced as "the matchmaking policy for the learner policy" in the selection step. The dependent claims that reference "the matchmaking policies" (e.g., Claims 2, 7, 19) trace cleanly back to the maintaining step in Claim 1. No orphaned "the" references were identified across Claims 1–30.
The specification provides direct written description support for the key independent claim limitations. FIG. 1 maps to the pool maintenance step (learner policies 142A-M, fixed policy 152, training engine 120), FIG. 2 maps to the three-step training process in Claims 1/20/21, and FIG. 3 maps to the per-learner-policy iteration sub-steps (selection 302, training data generation 304, parameter update 306). The "learner-to-fixed conversion" limitation in Claims 1, 20, and 21 is supported by detailed written description at columns 11–12. However, Equation 1 (the weighting function for matchmaking selection) appears in the spec but has no corresponding claim language referencing a probability weighting function, representing a mild asymmetry.
All three independent claims use "comprising" as the transition word, which is the strategically correct choice for this technology domain — it renders the claims open-ended, meaning a system or method that adds further steps or components to the claimed structure still infringes. The use of "comprising" in both the method claim preamble and in the pool-maintenance sub-limitation ("the pool of candidate action selection policies comprising: (i)...(ii)...") is also correct and consistent. No restrictive "consisting of" or "consisting essentially of" language appears, and there were no missed opportunities to broaden by use of "including" or "having."
No "means for" or "step for" language appears in any of the 30 claims, and the claims do not use functional label constructs (e.g., "selection means," "training module") that would trigger §112(f) interpretation. The independent claims use active verb forms — "maintaining," "selecting," "generating," "updating," "determining" — which are well-established as avoiding §112(f) treatment under USPTO examination guidelines. The system claims (Claims 21–30) are drafted as computer-implemented operations rather than as named structural components, which also avoids §112(f) exposure. This is a clean, risk-free drafting approach for this technology domain.
The independent claims carry meaningful Alice Step 1 exposure: Claim 1 recites maintaining data structures, selecting policies, generating training data, and updating parameters — all abstract computational operations that a court may characterize as a mathematical concept or mental process. The hardware tie-in in Claims 20 and 21 ("non-transitory computer-readable storage media" and "system comprising one or more computers") provides a §101 defense under Alice Step 2B but only at the structural claim level. Claim 1 (method) lacks any explicit hardware anchor. A stronger filing would have included at least one dependent claim specifying the distributed computing architecture (e.g., actor-learner architecture referenced in the spec) as a concrete hardware limitation to reinforce the §101 defense during prosecution.
The dependent claims from Claim 1 add genuinely distinct fallback positions. Claims 3–6 provide a structured hierarchy of matchmaking distribution specificity (type-specific → learner-only uniform → all-policy uniform), each representing a narrower but distinct scope. Claims 7–9 add a meaningful technical dimension around hyperparameter diversity, which is a key aspect of the diversified learning strategy described in the specification. Claims 10–11 add the supervised learning initialization mechanism. However, Claims 22–30 (depending from Claims 21) are near-verbatim parallels of Claims 2–10 for the system claim, adding little incremental protection beyond what the tripartite structure already provides; a continuation adding CRM-specific dependent variations would have strengthened the portfolio.
The abstract describes the invention at a functional level — "maintaining data specifying a pool of candidate action selection policies; maintaining data specifying respective matchmaking policy; and training the policy neural network using a reinforcement learning technique" — which is accurate but insufficiently differentiated from generic RL training systems. An examiner reading only the abstract would not identify the novel contribution: the per-learner-policy matchmaking distributions that solve the large-state-space exploration problem. The abstract omits the learner-to-fixed-policy conversion mechanism, which is the distinguishing structural feature of the claims, and fails to name the specific technical problem being solved (insufficient strategy diversity in multi-agent RL training).
The three figures provide only high-level structural support for the claims, and several key claim limitations lack dedicated figure support. FIG. 1 supports the pool architecture (learner policies 142A-M, matchmaking policies 144A-M, fixed policy 152), FIG. 2 maps to the top-level three-step process in Claims 1/20/21, and FIG. 3 maps to the per-learner iteration sub-steps. However, no figure illustrates the learner-to-fixed-policy conversion mechanism (the distinguishing limitation in all three independent claims), the hyperparameter diversity mechanism (Claim 7), the internal reward computation (Claim 9), or the supervised learning initialization (Claim 10). A stronger filing would have included a figure showing the conversion trigger logic and the resulting pool state transition.
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Scorecard
Strategic Intent Scorecard
Multi-dimensional assessment of this application's patent strategy quality, based on claim structure, specification depth, and prosecution positioning.
Claim Breadth
3.8
Prosecution Defensibility
3.5
Spec–Claim Consistency
3.6
Dependent Claim Coverage
4
Claim Type Diversity
4.5
Figure Support Quality
2.8
Key observation: Claim Type Diversity scores highest (4.5/5.0) because the tripartite structure across Claims 1 (method), 20 (CRM), and 21 (system) provides enforcement coverage in all three formats most relevant to AI software deployments, significantly complicating design-around attempts. Figure Support Quality scores lowest (2.8/5.0) because the three figures — all high-level flow diagrams and a single system architecture — leave the learner-to-fixed conversion mechanism, hyperparameter diversity, and supervised learning initialization (all claimed in independent or second-tier dependent claims) without any dedicated diagrammatic support, creating written description vulnerability for those limitations. Practitioners drafting continuations should prioritize adding figures that illustrate the conversion trigger logic and pool state transitions to anchor the deeper dependent claims.
A senior-attorney lens on the three highest-priority structural weaknesses — what each exposes in prosecution and litigation, and what a stronger filing would have done differently.
GAP 01 · HIGHEST IMPACT
No Dedicated Claim on Matchmaking Weighting Function
The structural weakness is that Equation 1 — the probability weighting function f:[0,1]→[0,∞) that assigns selection weights proportional to policy performance — appears in detail in the specification (column 9) but is never claimed in any of the 30 claims. This creates a critical design-around risk: a competitor could implement the core diversification mechanism using the exact weighting function disclosed in the spec without infringing any claim, simply by varying the specific matchmaking distribution type while retaining the performance-proportional selection logic. A stronger filing would have added at least one dependent claim reciting that the matchmaking policy assigns selection probabilities to candidate policies as a function of their respective performance scores, as measured by received rewards or RL loss function values.
GAP 02 · HIGH IMPACT
Method Claim 1 Lacks Any Hardware Anchor for §101 Defense
Claim 1 recites exclusively abstract computational operations — maintaining data, selecting policies, generating training data, updating parameters — with no reference to any physical computing system, distributed architecture, or hardware component. This creates a §101 Alice Step 1 vulnerability where the claim could be characterized as directed to an abstract mathematical method for adjusting probability distributions. While Claims 20 and 21 are anchored to non-transitory media and computer systems respectively, a method claim that survives §101 independently is essential for enforcement against process infringers who may not own the storage media or systems. A stronger filing would have introduced a dependent claim specifying the actor-learner distributed architecture described in the spec (citing Mnih et al., arXiv:1602.01783 IMPALA architecture) as the concrete hardware context.
GAP 03 · HIGH IMPACT
No Claims on Cooperative Multi-Agent Task Variants
Unlock to read the full analysis.
🔒
3 Critical Gaps in This Claim Set
See the full attorney-level analysis of what this application leaves unprotected — and how to draft it more defensively for your own filings.
US 11,627,165 B2 protects a method, computer-readable storage medium, and system for training a policy neural network in multi-agent reinforcement learning environments. The patent solves the problem of insufficient strategy diversity when training agents to perform tasks in environments with large state and strategic spaces. The specific mechanism is maintaining a pool of candidate action selection policies where each learner policy has its own matchmaking policy — a probability distribution — used to select training opponents, with criteria for converting high-performing learner policies into fixed policies.
US 11,627,165 B2 is assigned to DeepMind Technologies Limited, London, GB. The named inventors are David Silver (Hitchin, GB), Oriol Vinyals (London, GB), and Maxwell Elliot Jaderberg (London, GB).
Claim 1 is a method claim covering the training of a policy neural network by maintaining a pool of candidate policies with per-learner matchmaking distributions, iteratively selecting training opponents via matchmaking, updating policy parameters through RL, and converting qualifying learners into fixed policies. Claim 20 is a CRM (non-transitory computer-readable storage media) claim reciting the same operations as Claim 1 encoded as executable instructions. Claim 21 is a system claim covering one or more computers and storage devices configured to perform the same operations as Claims 1 and 20.
This patent covers a training technique for AI systems where multiple software agents learn to perform tasks by interacting with each other — such as in games, robotics, or simulations. The core problem is that when many agents train together, they can all end up learning the same narrow strategy, failing to explore the full range of possible situations. DeepMind's solution gives each learning agent its own "matchmaking policy" — a set of rules for choosing which other agents to train against — so different learning agents are exposed to different opponent strategies, making the overall training system more robust and capable.
G06N 3/08 (2006.01) — Artificial neural networks: learning methods. H04L 9/40 (2022.01) — Network security and protocols, covering cybersecurity environment implementations. G06K 9/62 (2022.01) — Methods for character or pattern recognition using machine learning techniques. Additional CPC classifications include H04L 63/205 (2013.01), G06K 9/6256 (2013.01), and G06N 3/08 (2013.01).
Still have questions? PatSnap Eureka can answer them from patent data instantly. Search in Eureka
PatSnap Eureka
Ready to Draft Your Next Patent with AI?
PatSnap Eureka's AI drafting agent writes structured claims, flags coverage gaps, and positions your application for prosecution success.
Disclaimer: This analysis is generated by PatSnap Eureka AI based on publicly available patent data from the USPTO. It does not constitute legal advice and should not be relied upon as such. Patent data may be subject to change as prosecution progresses. Scores and assessments reflect automated analysis and may not capture all relevant legal or technical nuances. Always consult a qualified patent attorney for formal legal opinions on patentability, freedom to operate, or infringement.
Ask anything about this patent. PatSnap Eureka searches patents and data to answer instantly.