Book a demo

Patent Drafting Analysis of DeepMind Technologies Limited’s Asynchronous Deep Reinforcement Learning System | US 12,020,155 B2

Patent Drafting Analysis of DeepMind Technologies Limited’s Asynchronous Deep Reinforcement Learning System | US 12,020,155 B2
IP Drafting Analysis · US 12,020,155 B2

Patent Drafting Analysis of DeepMind's Asynchronous Deep Reinforcement Learning System | US 12,020,155 B2

A structural and strategic analysis of US 12,020,155 B2, examining claim architecture, drafting quality, critical gaps, and prosecution positioning for DeepMind's parallel worker-based deep RL training system.

US 12,020,155 B2Filed: Apr 29, 2022Granted: Jun 25, 2024G06N 3/08G06N 3/04G06N 3/045
Spec Words
5,200
Across 6 sections
Draft now ↗
Total Claims
20
3 independent · 17 dependent
Draft now ↗
Figure Sheets
5
System architecture, training flow diagrams, RL technique iterations
Draft now ↗
Published by PatSnap Insights Team · · 12 min read Verified by PatSnap Eureka Data
Overview

Structural Overview

The detailed description dominates at approximately 50% of total specification words (~2,600 of ~5,200), with the claims section representing a substantial ~25% share, reflecting the complexity of the multi-worker asynchronous training architecture. The patent presents 20 claims across 3 independent claims (Claims 1, 8, and 15) covering system, computer storage media, and method claim types, with a dependent-to-independent ratio of approximately 5.7:1. The 5 figures provide adequate but lean coverage — FIG. 1 shows the overall system architecture while FIGs. 2–5 illustrate process flows for training and RL technique iterations.

Section Word Distribution

Detailed Desc. 2600 w Claims 1300 w Summary 520 w Background 310 w Brief Desc. 205 w Abstract 105 w ↗ Click bars to explore

Figure Inventory — 5 Sheets

FigureDescriptionRole
FIG. 1
Neural Network Training System 100 showing multiple workers (102A–102N), actors (104A–104N), environment replicas (106A–106N), and shared memory 110 interconnected in a parallel architecture.Search in Eureka ↗
System architecture
FIG. 2
Flow diagram of process 200 for training a deep neural network, including steps for determining parameter values, selecting actions, updating accumulated gradients, and conditionally writing to shared memory (steps 202–220).Search in Eureka ↗
Flow diagram
FIG. 3
Flow diagram of process 300 for performing an iteration of a Q-learning technique, showing steps for receiving observation/action/reward, determining maximum target network output, determining error, and computing gradient (steps 302–308).Search in Eureka ↗
Flow diagram
FIG. 4
Flow diagram of process 400 for performing an iteration of a SARSA technique, including steps for receiving inputs, selecting next action, determining next target network output, determining error, and computing gradient (steps 402–410).Search in Eureka ↗
Flow diagram
FIG. 5
Flow diagram of process 500 for training a policy neural network using baseline scores and actual long-term rewards, with steps for determining policy network parameters, receiving observations, computing gradient updates, and conditionally writing to shared memory (steps 502–520).Search in Eureka ↗
Claim support
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Claims

Claim Architecture Analysis

The patent presents 3 independent claims: Claim 1 (system), Claim 8 (non-transitory storage media/CRM), and Claim 15 (method), providing tripartite enforcement coverage across all principal claim types. The dependent-to-independent ratio is 5.67:1 (17 dependent claims across 3 independent claims), which is at the low-to-moderate end for AI/ML software patents in class G06N where ratios of 8:1 or higher are common. Notably, Claims 6–7 and 13–14 add specific mathematical gradient formulations as dependent limitations, providing valuable fallback positions if broader claim language is challenged.

Core inventive concept: The claims address the computational bottleneck and communication overhead in parallelised deep reinforcement learning by training a deep neural network comprising both a policy neural network and a baseline neural network using multiple asynchronous workers that each maintain a local instance of the network, access shared memory for current parameter values, and conditionally write updated accumulated gradients back to shared memory — enabling on-policy RL training without the need for experience replay memory. The key mechanism, recited across Claims 1, 8, and 15, is the asynchronous, criterion-gated update cycle where each worker 'determining whether criteria for updating the current values of the parameters of the deep neural network have been satisfied' before writing to shared memory.

Independent Claim Dissection

ClaimPreambleTransitionKey Body Elements
Claim 1A system for training a deep neural network used to control an agent that interacts with an environment by performing actions selected from a predetermined set of actions, the deep neural network comprising a policy neural network having a plurality of policy parameters and a baseline neural network having a plurality of baseline parameterscomprising
one or more computers configured to implement one or more workers; each worker associated with respective actor and environment instance; workers configured to determine current parameter values from shared memory; receive observations and select actions using policy network scores; generate baseline scores using baseline network; identify actual rewards; determine actual long-term rewards; perform RL technique iteration for gradients; update accumulated gradients; determine criteria satisfaction; when satisfied, write updated parameter values to shared memorySearch prior art ↗
Claim 8One or more non-transitory storage media storing instructions that when executed by one or more computers cause the one or more computers to implement a system for training a deep neural network used to control an agent that interacts with an environment by performing actions selected from a predetermined set of actions, the deep neural network comprising a policy neural network having a plurality of policy parameters and a baseline neural network having a plurality of baseline parameterscomprising
same structural limitations as Claim 1 encoded as CRM instructions: one or more workers each with actor and environment instance; shared memory access for policy and baseline parameters; observation-based action selection; baseline score generation; actual reward identification; long-term reward determination; RL technique gradient iteration; accumulated gradient updates; criteria-gated parameter updates written to shared memorySearch prior art ↗
Claim 15A method performed by one or more computers for training a deep neural network used to control an agent that interacts with an environment by performing actions selected from a predetermined set of actions, the deep neural network comprising a policy neural network having a plurality of policy parameters and a baseline neural network having a plurality of baseline parameterscomprising
determining by first worker current parameter values from shared memory; receiving observations and selecting actions using policy network; generating baseline scores from baseline network; identifying actual rewards; determining actual long-term reward; performing RL technique iteration for gradients for baseline and policy networks; updating accumulated gradients; determining criteria satisfaction; when satisfied, writing updated parameter values to shared memorySearch prior art ↗

Claim Dependency Tree

1 System: multiple workers with actors/environment replicas using shared memory for asynchronous policy+baseline network trainingSearch Claim 1 prior art ↗
2 Adds: all workers execute on the same computerSearch in Eureka ↗
3 Adds: clearing updated accumulated gradient when update criteria satisfiedSearch in Eureka ↗
4 Adds: update criteria = specified number of RL iterations since preceding updateSearch in Eureka ↗
5 Adds: actual long-term reward determination method — last observation uses baseline score; earlier observations use reward plus discounted sumSearch in Eureka ↗
6 Adds: policy gradient update formula ∇θ log π(at|st;θ)(Rt−bt)Search in Eureka ↗
7 Further: baseline gradient update formula ∂(Rt−b(st;θ'b))²/∂θ'b (depends on Claim 6)Search in Eureka ↗
8 CRM: non-transitory storage media — same asynchronous policy+baseline worker training system as Claim 1Search Claim 8 prior art ↗
9 Adds: all workers execute on the same computerSearch in Eureka ↗
10 Adds: clearing updated accumulated gradient when update criteria satisfiedSearch in Eureka ↗
11 Adds: update criteria = specified number of RL iterations since preceding updateSearch in Eureka ↗
12 Adds: actual long-term reward method — last observation uses baseline score; earlier use discounted sumSearch in Eureka ↗
13 Adds: policy gradient formula ∇θ log π(at|st;θ)(Rt−bt)Search in Eureka ↗
14 Further: baseline gradient formula ∂(Rt−b(st;θ'b))²/∂θ'b (depends on Claim 13)Search in Eureka ↗
15 Method: computer-implemented method for training deep neural network with policy+baseline networks using asynchronous workersSearch Claim 15 prior art ↗
16 Adds: all workers execute on the same computerSearch in Eureka ↗
17 Adds: clearing updated accumulated gradient when update criteria satisfiedSearch in Eureka ↗
18 Adds: update criteria = specified number of RL iterations since preceding updateSearch in Eureka ↗
19 Adds: actual long-term reward determination — last observation uses baseline score; earlier use discounted sumSearch in Eureka ↗
20 Adds: policy gradient formula ∇θ log π(at|st;θ)(Rt−bt)Search in Eureka ↗
MetricThis ApplicationSoftware / AI / ML Industry Norm
Total claims2020 – 30
Independent claim count33 – 5
Dependent : Independent ratio5.67 : 16 – 9 : 1
Method claims present?Yes — Claim 15Common
System / apparatus claims?Yes — Claim 1Common
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Drafting Quality

Drafting Quality Signals

The claim set demonstrates strong structural consistency across the tripartite system/CRM/method framework, with Claims 6–7, 13–14, and 20 providing mathematically-defined gradient formulas as meaningful fallback positions. However, the near-identical claim bodies across Claims 1, 8, and 15 — with only the 'first worker' phrasing distinguishing Claim 15 — create a risk that a successful invalidity argument against one independent claim could cascade across all three without truly distinct defensive positions.

Antecedent Basis
The claim set is largely clean on antecedent basis with no identifiable unsupported 'the [element]' references. Claim 1 introduces 'one or more workers' and consistently refers back to 'the one or more workers'; 'the baseline neural network' and 'the policy neural network' are properly introduced in the preamble of Claims 1, 8, and 15 before being referenced in the claim body. The 'first worker' language in Claim 15 is introduced without prior antecedent in the preamble, which could attract an examiner objection at the continuation stage, though in context it is reasonably understood as one of the 'one or more workers.'
Spec–Claim Consistency
The specification maps well to the independent claim limitations. FIG. 1 and the detailed description (col. 3–4) directly support the 'one or more workers,' 'shared memory 110,' and 'actor/environment replica' structural elements of Claim 1. FIG. 2 (steps 202–218) maps to the 'determining current values,' 'selecting actions,' 'updating accumulated gradients,' and 'writing updated values to shared memory' limitations. FIG. 5 (steps 502–520) directly supports the policy neural network and baseline neural network gradient update operations recited in Claims 1, 8, and 15. The gradient formulas in Claims 6–7, 13–14, and 20 are supported by the mathematical expressions on pages 9–10 of the specification.
Transition Word Usage
All three independent claims (1, 8, 15) correctly use 'comprising' as the transition word, preserving open-ended claim scope and preventing easy design-around by adding additional network components. This is the strategically optimal choice for a machine learning system patent where competitors might add supplementary neural network components while practising the core asynchronous training method. No 'consisting of' or 'consisting essentially of' narrowing transitions appear anywhere in the claim set, which is appropriate given the breadth of the inventive concept.
§112(f) Means-Plus-Function Risk
No 'means for' or 'step for' language appears in any claim, eliminating direct §112(f) trigger language. Functional limitations such as 'configured to implement' and 'configured to repeatedly perform operations comprising' are tied to 'one or more computers' as the structural actor, which courts have consistently treated as sufficient structural recitation to avoid §112(f) interpretation. The 'configured to' construction throughout Claims 1 and 8 follows established drafting practice for software-implemented inventions, providing adequate structural grounding.
⚠️
§101 Eligibility Risk
Under Alice/Mayo, Claims 1, 8, and 15 carry moderate §101 exposure because the core invention is a mathematical optimisation method (gradient-based neural network parameter updates) implemented on a general-purpose computer — a pattern that has drawn repeated Alice rejections in art unit 2120. The §101 defense rests primarily on the 'one or more computers' hardware tie-in in Claim 1 and the 'non-transitory storage media' in Claim 8, and on the practical application argument that the system trains an agent to 'control' real-world or simulated environments. However, no claim element recites a specific hardware accelerator, FPGA, or dedicated processor architecture that would strengthen the §101 defence, and the shared memory architecture (while technically concrete) may not be viewed as 'significantly more' than the abstract idea by an examiner applying the two-step Alice framework.
⚠️
Dependent Claim Fallback Quality
The dependent claims add meaningful but highly repetitive fallback positions: Claims 2, 9, and 16 (same-computer execution), Claims 3, 10, and 17 (gradient clearing), Claims 4, 11, and 18 (iteration-count criteria), and Claims 5, 12, and 19 (long-term reward determination) are structurally identical across the three independent claims, adding only three truly distinct technical limitations across 17 dependent claims. Claims 6–7, 13–14, and 20 provide the most substantive fallback by specifying the exact mathematical gradient formulas, but Claim 7 depending on Claim 6 (and Claim 14 depending on Claim 13) means the most mathematically specific limitation is two dependency layers removed from the broadest claim — a narrow fallback that may not survive if Claim 6 or 13 is also found invalid.
⚠️
Abstract Quality
An examiner reading only the abstract may identify the multi-worker parallel training architecture but will not identify the novel baseline neural network contribution — the abstract states 'each worker is configured to operate independently of each other worker, and wherein each worker is associated with a respective actor that interacts with a respective replica of the environment,' but omits the critical policy-plus-baseline dual-network structure that distinguishes this invention from prior asynchronous RL work (such as Mnih et al. 2015). The abstract also does not mention the criterion-gated shared memory write mechanism, which is the operationally novel aspect of the system. A stronger abstract would have foregrounded both the dual-network architecture and the asynchronous gradient accumulation mechanism.
Figure Support Quality
FIG. 1 directly supports the structural elements of Claims 1 and 8 (workers 102A–102N, actors 104A–104N, environment replicas 106A–106N, shared memory 110). FIG. 2 supports the asynchronous update cycle limitations in all three independent claims, including the conditional 'write updated values to shared memory' (step 218) and 'refrain from writing' (step 220) limitations. FIG. 5 provides direct figure support for the policy and baseline network gradient operations of Claims 1, 8, and 15. However, no figure illustrates the specific mathematical gradient formula operations recited in Claims 6–7, 13–14, and 20 — the gradient equations appear only in the specification text, not in any figure, which is a modest but not fatal gap in figure support.
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Scorecard

Strategic Intent Scorecard

Multi-dimensional assessment of this application's patent strategy quality, based on claim structure, specification depth, and prosecution positioning.

Claim Breadth
3.5
Prosecution Defensibility
3.8
Spec–Claim Consistency
4.2
Dependent Claim Coverage
2.8
Claim Type Diversity
4
Figure Support Quality
3.5
Breadth Prosecution Consistency Dep. Coverage Claim Types Figures
Key observation: Spec–Claim Consistency (4.2/5.0) is the strongest dimension — every structural limitation in Claims 1, 8, and 15 maps to a named component in FIG. 1 and a numbered process step in FIGs. 2 and 5, providing robust written description support that would withstand a §112(a) challenge. Dependent Claim Coverage (2.8/5.0) is the weakest dimension — 17 dependent claims add only three substantively distinct technical limitations (gradient clearing, iteration-count trigger, and long-term reward calculation), with 12 of 17 dependent claims merely mirroring the same limitations across the three parallel independent claims rather than introducing new, independently valuable fallback positions. Practitioners drafting continuations should prioritise adding dependent claims that address specific network architectures (recurrent, convolutional), distributed-machine embodiments, and alternative optimisation algorithms (RMSProp, Adam) to build a more defensible dependent claim landscape.
See how your own draft compares — Open Eureka IP Drafting →
Critical Gaps

3 Critical Gaps in This Claim Set

A senior-attorney lens on the three highest-priority structural weaknesses — what each exposes in prosecution and litigation, and what a stronger filing would have done differently.

🔒

3 Critical Gaps in This Claim Set

See the full attorney-level analysis of what this application leaves unprotected — and how to draft it more defensively for your own filings.

No distributed multi-machine worker claim Lockless shared memory write unclaimed Target network sync frequency not claimed
Unlock Full Analysis — Free
Frequently asked questions

US 12,020,155 B2 — key questions answered

Still have questions? PatSnap Eureka can answer them from patent data instantly. Search in Eureka
PatSnap Eureka

Ready to Draft Your Next Patent with AI?

PatSnap Eureka's AI drafting agent writes structured claims, flags coverage gaps, and positions your application for prosecution success.

Disclaimer: This analysis is generated by PatSnap Eureka AI based on publicly available patent data from the USPTO. It does not constitute legal advice and should not be relied upon as such. Patent data may be subject to change as prosecution progresses. Scores and assessments reflect automated analysis and may not capture all relevant legal or technical nuances. Always consult a qualified patent attorney for formal legal opinions on patentability, freedom to operate, or infringement.

Ask anything about this patent.
PatSnap Eureka searches patents and data to answer instantly.
Powered by PatSnap Eureka
Link copied to clipboard

Eureka built for innovation research

Eureka built for research
Domain-specific AI agents for IP, Engineering, Life Sciences, and Materials
Patents, Scientific Literature, Compounds & More Unified in One Platform
Ask, Research, Solve, Draft, and Validate Your Work from Weeks to Minutes
Try it for Free

Help us improve this page

Found incorrect or outdated information? Let us know and we'll get it fixed.