To start using PatSnap Eureka, click the verification button in the email we sent to .
This helps keep your account secure. Haven't received it? Check your spam folder.
Patent Drafting Analysis of DeepMind Technologies Limited’s Asynchronous Deep Reinforcement Learning System | US 12,020,155 B2
Patent Drafting Analysis of DeepMind Technologies Limited’s Asynchronous Deep Reinforcement Learning System | US 12,020,155 B2
IP Drafting Analysis · US 12,020,155 B2
Patent Drafting Analysis of DeepMind's Asynchronous Deep Reinforcement Learning System | US 12,020,155 B2
A structural and strategic analysis of US 12,020,155 B2, examining claim architecture, drafting quality, critical gaps, and prosecution positioning for DeepMind's parallel worker-based deep RL training system.
US 12,020,155 B2Filed: Apr 29, 2022Granted: Jun 25, 2024G06N 3/08G06N 3/04G06N 3/045
System architecture, training flow diagrams, RL technique iterations
Draft now ↗
Published byPatSnap Insights Team · · 12 min read Verified by PatSnap Eureka Data
Overview
Structural Overview
The detailed description dominates at approximately 50% of total specification words (~2,600 of ~5,200), with the claims section representing a substantial ~25% share, reflecting the complexity of the multi-worker asynchronous training architecture. The patent presents 20 claims across 3 independent claims (Claims 1, 8, and 15) covering system, computer storage media, and method claim types, with a dependent-to-independent ratio of approximately 5.7:1. The 5 figures provide adequate but lean coverage — FIG. 1 shows the overall system architecture while FIGs. 2–5 illustrate process flows for training and RL technique iterations.
Section Word Distribution
↗ Click bars to explore
Figure Inventory — 5 Sheets
Figure
Description
Role
FIG. 1
Neural Network Training System 100 showing multiple workers (102A–102N), actors (104A–104N), environment replicas (106A–106N), and shared memory 110 interconnected in a parallel architecture.Search in Eureka ↗
System architecture
FIG. 2
Flow diagram of process 200 for training a deep neural network, including steps for determining parameter values, selecting actions, updating accumulated gradients, and conditionally writing to shared memory (steps 202–220).Search in Eureka ↗
Flow diagram
FIG. 3
Flow diagram of process 300 for performing an iteration of a Q-learning technique, showing steps for receiving observation/action/reward, determining maximum target network output, determining error, and computing gradient (steps 302–308).Search in Eureka ↗
Flow diagram
FIG. 4
Flow diagram of process 400 for performing an iteration of a SARSA technique, including steps for receiving inputs, selecting next action, determining next target network output, determining error, and computing gradient (steps 402–410).Search in Eureka ↗
Flow diagram
FIG. 5
Flow diagram of process 500 for training a policy neural network using baseline scores and actual long-term rewards, with steps for determining policy network parameters, receiving observations, computing gradient updates, and conditionally writing to shared memory (steps 502–520).Search in Eureka ↗
Claim support
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Claims
Claim Architecture Analysis
The patent presents 3 independent claims: Claim 1 (system), Claim 8 (non-transitory storage media/CRM), and Claim 15 (method), providing tripartite enforcement coverage across all principal claim types. The dependent-to-independent ratio is 5.67:1 (17 dependent claims across 3 independent claims), which is at the low-to-moderate end for AI/ML software patents in class G06N where ratios of 8:1 or higher are common. Notably, Claims 6–7 and 13–14 add specific mathematical gradient formulations as dependent limitations, providing valuable fallback positions if broader claim language is challenged.
Core inventive concept: The claims address the computational bottleneck and communication overhead in parallelised deep reinforcement learning by training a deep neural network comprising both a policy neural network and a baseline neural network using multiple asynchronous workers that each maintain a local instance of the network, access shared memory for current parameter values, and conditionally write updated accumulated gradients back to shared memory — enabling on-policy RL training without the need for experience replay memory. The key mechanism, recited across Claims 1, 8, and 15, is the asynchronous, criterion-gated update cycle where each worker 'determining whether criteria for updating the current values of the parameters of the deep neural network have been satisfied' before writing to shared memory.
Independent Claim Dissection
Claim
Preamble
Transition
Key Body Elements
Claim 1
A system for training a deep neural network used to control an agent that interacts with an environment by performing actions selected from a predetermined set of actions, the deep neural network comprising a policy neural network having a plurality of policy parameters and a baseline neural network having a plurality of baseline parameters
comprising
one or more computers configured to implement one or more workers; each worker associated with respective actor and environment instance; workers configured to determine current parameter values from shared memory; receive observations and select actions using policy network scores; generate baseline scores using baseline network; identify actual rewards; determine actual long-term rewards; perform RL technique iteration for gradients; update accumulated gradients; determine criteria satisfaction; when satisfied, write updated parameter values to shared memorySearch prior art ↗
Claim 8
One or more non-transitory storage media storing instructions that when executed by one or more computers cause the one or more computers to implement a system for training a deep neural network used to control an agent that interacts with an environment by performing actions selected from a predetermined set of actions, the deep neural network comprising a policy neural network having a plurality of policy parameters and a baseline neural network having a plurality of baseline parameters
comprising
same structural limitations as Claim 1 encoded as CRM instructions: one or more workers each with actor and environment instance; shared memory access for policy and baseline parameters; observation-based action selection; baseline score generation; actual reward identification; long-term reward determination; RL technique gradient iteration; accumulated gradient updates; criteria-gated parameter updates written to shared memorySearch prior art ↗
Claim 15
A method performed by one or more computers for training a deep neural network used to control an agent that interacts with an environment by performing actions selected from a predetermined set of actions, the deep neural network comprising a policy neural network having a plurality of policy parameters and a baseline neural network having a plurality of baseline parameters
comprising
determining by first worker current parameter values from shared memory; receiving observations and selecting actions using policy network; generating baseline scores from baseline network; identifying actual rewards; determining actual long-term reward; performing RL technique iteration for gradients for baseline and policy networks; updating accumulated gradients; determining criteria satisfaction; when satisfied, writing updated parameter values to shared memorySearch prior art ↗
Claim Dependency Tree
1 System: multiple workers with actors/environment replicas using shared memory for asynchronous policy+baseline network trainingSearch Claim 1 prior art ↗
3 Adds: clearing updated accumulated gradient when update criteria satisfiedSearch in Eureka ↗
4 Adds: update criteria = specified number of RL iterations since preceding updateSearch in Eureka ↗
5 Adds: actual long-term reward determination method — last observation uses baseline score; earlier observations use reward plus discounted sumSearch in Eureka ↗
6 Adds: policy gradient update formula ∇θ log π(at|st;θ)(Rt−bt)Search in Eureka ↗
7 Further: baseline gradient update formula ∂(Rt−b(st;θ'b))²/∂θ'b (depends on Claim 6)Search in Eureka ↗
8 CRM: non-transitory storage media — same asynchronous policy+baseline worker training system as Claim 1Search Claim 8 prior art ↗
10 Adds: clearing updated accumulated gradient when update criteria satisfiedSearch in Eureka ↗
11 Adds: update criteria = specified number of RL iterations since preceding updateSearch in Eureka ↗
12 Adds: actual long-term reward method — last observation uses baseline score; earlier use discounted sumSearch in Eureka ↗
13 Adds: policy gradient formula ∇θ log π(at|st;θ)(Rt−bt)Search in Eureka ↗
14 Further: baseline gradient formula ∂(Rt−b(st;θ'b))²/∂θ'b (depends on Claim 13)Search in Eureka ↗
15 Method: computer-implemented method for training deep neural network with policy+baseline networks using asynchronous workersSearch Claim 15 prior art ↗
17 Adds: clearing updated accumulated gradient when update criteria satisfiedSearch in Eureka ↗
18 Adds: update criteria = specified number of RL iterations since preceding updateSearch in Eureka ↗
19 Adds: actual long-term reward determination — last observation uses baseline score; earlier use discounted sumSearch in Eureka ↗
20 Adds: policy gradient formula ∇θ log π(at|st;θ)(Rt−bt)Search in Eureka ↗
Metric
This Application
Software / AI / ML Industry Norm
Total claims
20
20 – 30
Independent claim count
3
3 – 5
Dependent : Independent ratio
5.67 : 1
6 – 9 : 1
Method claims present?
Yes — Claim 15
Common
System / apparatus claims?
Yes — Claim 1
Common
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Drafting Quality
Drafting Quality Signals
The claim set demonstrates strong structural consistency across the tripartite system/CRM/method framework, with Claims 6–7, 13–14, and 20 providing mathematically-defined gradient formulas as meaningful fallback positions. However, the near-identical claim bodies across Claims 1, 8, and 15 — with only the 'first worker' phrasing distinguishing Claim 15 — create a risk that a successful invalidity argument against one independent claim could cascade across all three without truly distinct defensive positions.
✅
Antecedent Basis
The claim set is largely clean on antecedent basis with no identifiable unsupported 'the [element]' references. Claim 1 introduces 'one or more workers' and consistently refers back to 'the one or more workers'; 'the baseline neural network' and 'the policy neural network' are properly introduced in the preamble of Claims 1, 8, and 15 before being referenced in the claim body. The 'first worker' language in Claim 15 is introduced without prior antecedent in the preamble, which could attract an examiner objection at the continuation stage, though in context it is reasonably understood as one of the 'one or more workers.'
The specification maps well to the independent claim limitations. FIG. 1 and the detailed description (col. 3–4) directly support the 'one or more workers,' 'shared memory 110,' and 'actor/environment replica' structural elements of Claim 1. FIG. 2 (steps 202–218) maps to the 'determining current values,' 'selecting actions,' 'updating accumulated gradients,' and 'writing updated values to shared memory' limitations. FIG. 5 (steps 502–520) directly supports the policy neural network and baseline neural network gradient update operations recited in Claims 1, 8, and 15. The gradient formulas in Claims 6–7, 13–14, and 20 are supported by the mathematical expressions on pages 9–10 of the specification.
All three independent claims (1, 8, 15) correctly use 'comprising' as the transition word, preserving open-ended claim scope and preventing easy design-around by adding additional network components. This is the strategically optimal choice for a machine learning system patent where competitors might add supplementary neural network components while practising the core asynchronous training method. No 'consisting of' or 'consisting essentially of' narrowing transitions appear anywhere in the claim set, which is appropriate given the breadth of the inventive concept.
No 'means for' or 'step for' language appears in any claim, eliminating direct §112(f) trigger language. Functional limitations such as 'configured to implement' and 'configured to repeatedly perform operations comprising' are tied to 'one or more computers' as the structural actor, which courts have consistently treated as sufficient structural recitation to avoid §112(f) interpretation. The 'configured to' construction throughout Claims 1 and 8 follows established drafting practice for software-implemented inventions, providing adequate structural grounding.
Under Alice/Mayo, Claims 1, 8, and 15 carry moderate §101 exposure because the core invention is a mathematical optimisation method (gradient-based neural network parameter updates) implemented on a general-purpose computer — a pattern that has drawn repeated Alice rejections in art unit 2120. The §101 defense rests primarily on the 'one or more computers' hardware tie-in in Claim 1 and the 'non-transitory storage media' in Claim 8, and on the practical application argument that the system trains an agent to 'control' real-world or simulated environments. However, no claim element recites a specific hardware accelerator, FPGA, or dedicated processor architecture that would strengthen the §101 defence, and the shared memory architecture (while technically concrete) may not be viewed as 'significantly more' than the abstract idea by an examiner applying the two-step Alice framework.
The dependent claims add meaningful but highly repetitive fallback positions: Claims 2, 9, and 16 (same-computer execution), Claims 3, 10, and 17 (gradient clearing), Claims 4, 11, and 18 (iteration-count criteria), and Claims 5, 12, and 19 (long-term reward determination) are structurally identical across the three independent claims, adding only three truly distinct technical limitations across 17 dependent claims. Claims 6–7, 13–14, and 20 provide the most substantive fallback by specifying the exact mathematical gradient formulas, but Claim 7 depending on Claim 6 (and Claim 14 depending on Claim 13) means the most mathematically specific limitation is two dependency layers removed from the broadest claim — a narrow fallback that may not survive if Claim 6 or 13 is also found invalid.
An examiner reading only the abstract may identify the multi-worker parallel training architecture but will not identify the novel baseline neural network contribution — the abstract states 'each worker is configured to operate independently of each other worker, and wherein each worker is associated with a respective actor that interacts with a respective replica of the environment,' but omits the critical policy-plus-baseline dual-network structure that distinguishes this invention from prior asynchronous RL work (such as Mnih et al. 2015). The abstract also does not mention the criterion-gated shared memory write mechanism, which is the operationally novel aspect of the system. A stronger abstract would have foregrounded both the dual-network architecture and the asynchronous gradient accumulation mechanism.
FIG. 1 directly supports the structural elements of Claims 1 and 8 (workers 102A–102N, actors 104A–104N, environment replicas 106A–106N, shared memory 110). FIG. 2 supports the asynchronous update cycle limitations in all three independent claims, including the conditional 'write updated values to shared memory' (step 218) and 'refrain from writing' (step 220) limitations. FIG. 5 provides direct figure support for the policy and baseline network gradient operations of Claims 1, 8, and 15. However, no figure illustrates the specific mathematical gradient formula operations recited in Claims 6–7, 13–14, and 20 — the gradient equations appear only in the specification text, not in any figure, which is a modest but not fatal gap in figure support.
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Scorecard
Strategic Intent Scorecard
Multi-dimensional assessment of this application's patent strategy quality, based on claim structure, specification depth, and prosecution positioning.
Claim Breadth
3.5
Prosecution Defensibility
3.8
Spec–Claim Consistency
4.2
Dependent Claim Coverage
2.8
Claim Type Diversity
4
Figure Support Quality
3.5
Key observation: Spec–Claim Consistency (4.2/5.0) is the strongest dimension — every structural limitation in Claims 1, 8, and 15 maps to a named component in FIG. 1 and a numbered process step in FIGs. 2 and 5, providing robust written description support that would withstand a §112(a) challenge. Dependent Claim Coverage (2.8/5.0) is the weakest dimension — 17 dependent claims add only three substantively distinct technical limitations (gradient clearing, iteration-count trigger, and long-term reward calculation), with 12 of 17 dependent claims merely mirroring the same limitations across the three parallel independent claims rather than introducing new, independently valuable fallback positions. Practitioners drafting continuations should prioritise adding dependent claims that address specific network architectures (recurrent, convolutional), distributed-machine embodiments, and alternative optimisation algorithms (RMSProp, Adam) to build a more defensible dependent claim landscape.
A senior-attorney lens on the three highest-priority structural weaknesses — what each exposes in prosecution and litigation, and what a stronger filing would have done differently.
GAP 01 · HIGHEST IMPACT
No Claims Cover Distributed Multi-Machine Worker Architecture
Claims 1, 8, and 15 are each qualified by dependent Claims 2, 9, and 16 respectively limiting execution to 'the same computer,' but the independent claims themselves do not positively claim — nor exclude — a distributed multi-machine architecture where workers operate across a network. This creates a design-around risk where a competitor implementing the identical asynchronous policy-plus-baseline training algorithm across multiple networked machines could argue non-infringement of the independent claims based on the same-computer dependent claim narrowing the prosecution history. A stronger filing would have included a separate independent claim affirmatively claiming the distributed (multi-machine) embodiment, which is described in the specification but never claimed independently.
GAP 02 · HIGH IMPACT
Shared Memory Architecture Undefined — Lockless Write Not Claimed
The independent claims recite 'a memory accessible by each of the one or more workers' but do not recite the asynchronous, lock-free (Hogwild-style) nature of the shared memory write mechanism — a key technical differentiator over synchronous parameter-server approaches that is discussed in the specification and cited prior art. This structural weakness means a competitor implementing the same algorithm with a locking synchronisation mechanism could argue that their system falls within the claim scope, weakening the patent's offensive value, or conversely that a fully lock-free implementation is not captured by the claim language. A stronger filing would have included dependent claims specifying that workers write to shared memory without acquiring exclusive locks, tying the claims to the Hogwild-style update mechanism that is the system's key performance advantage.
GAP 03 · HIGH IMPACT
No Claims on Target Network Synchronisation Frequency
Unlock to read the full analysis.
🔒
3 Critical Gaps in This Claim Set
See the full attorney-level analysis of what this application leaves unprotected — and how to draft it more defensively for your own filings.
No distributed multi-machine worker claimLockless shared memory write unclaimedTarget network sync frequency not claimed
US 12,020,155 B2 protects a system, computer storage media, and method for asynchronously training a deep neural network — specifically one comprising both a policy neural network and a baseline neural network — using multiple parallel workers that each interact with a separate environment replica via an associated actor. The patent covers the mechanism by which each worker independently accumulates gradients, determines whether update criteria have been satisfied, and when satisfied writes updated policy and baseline parameter values to a shared memory accessible by all workers, enabling on-policy reinforcement learning without experience replay.
The patent is owned by DeepMind Technologies Limited, located in London, GB. The inventors are Volodymyr Mnih (Toronto, Canada), Adrià Puigdomènech Badia (London, GB), Alexander Benjamin Graves (London, GB), Timothy James Alexander Harley (London, GB), David Silver (Hitchin, GB), and Koray Kavukcuoglu (London, GB).
Claim 1 is a system claim covering one or more computers implementing multiple asynchronous workers that use shared memory to train a deep neural network with both a policy neural network and a baseline neural network. Claim 8 is a computer-readable medium (CRM) claim encoding the same system functionality as non-transitory storage media instructions. Claim 15 is a method claim covering the same asynchronous worker-based training process performed by one or more computers, with the operations attributed to 'a first worker.'
This patent covers a way to train AI systems faster by running multiple 'workers' simultaneously on the same computer, each independently experimenting in a virtual copy of the environment and learning from their experiences. Each worker uses two neural networks — a 'policy' network that chooses actions and a 'baseline' network that estimates expected rewards — and periodically saves what it has learned to a shared memory that all workers can read. This approach eliminates the need to store and replay past experiences, reducing memory usage and allowing the AI to learn more efficiently and in parallel.
Disclaimer: This analysis is generated by PatSnap Eureka AI based on publicly available patent data from the USPTO. It does not constitute legal advice and should not be relied upon as such. Patent data may be subject to change as prosecution progresses. Scores and assessments reflect automated analysis and may not capture all relevant legal or technical nuances. Always consult a qualified patent attorney for formal legal opinions on patentability, freedom to operate, or infringement.
Ask anything about this patent. PatSnap Eureka searches patents and data to answer instantly.