To start using PatSnap Eureka, click the verification button in the email we sent to .
This helps keep your account secure. Haven't received it? Check your spam folder.
Patent Drafting Analysis of DeepMind Technologies Limited’s Asynchronous Deep Reinforcement Learning | US 11,783,182 B2
Patent Drafting Analysis of DeepMind Technologies Limited’s Asynchronous Deep Reinforcement Learning | US 11,783,182 B2
IP Drafting Analysis · US 11,783,182 B2
Patent Drafting Analysis of DeepMind Technologies' Asynchronous Deep Reinforcement Learning | US 11,783,182 B2
A structural and strategic analysis of US 11,783,182 B2, examining claim architecture, drafting quality, critical gaps, and prosecution positioning across DeepMind's core asynchronous RL training system patent.
US 11,783,182 B2Filed: Feb 8, 2021Granted: Oct 10, 2023G06N 3/08G06N 3/045G06N 3/04
Published byPatSnap Insights Team · · 12 min read Verified by PatSnap Eureka Data
Overview
Structural Overview
The detailed description dominates at approximately 50% of total specification words (~2,600 of ~5,200), providing solid technical depth for the asynchronous training architecture, while the claims section at ~1,730 words is notably large relative to the specification, reflecting verbose independent claim language. The patent presents 20 claims in a tripartite structure: 3 independent claims covering method (Claim 1), system (Claim 9), and CRM (Claim 17), with 17 dependents distributed across each independent. Five drawing sheets provide flow diagram and system architecture coverage, though the figures are limited to block diagrams and process flows with no detailed architectural diagrams of the shared memory update mechanism.
Section Word Distribution
↗ Click bars to explore
Figure Inventory — 5 Sheets
Figure
Description
Role
FIG. 1
Block diagram of the neural network training system 100 showing workers 102A-102N, actors 104A-104N, environment replicas 106A-106N, and shared memory 110.Search in Eureka ↗
System architecture
FIG. 2
Flow diagram of process 200 showing the per-worker training loop: determine parameters (202), receive observation (204), select action (206), receive reward (208), compute gradient (210-212), conditionally write to shared memory (214-220).Search in Eureka ↗
Flow diagram
FIG. 3
Flow diagram of process 300 for performing a Q-learning iteration: receive observation/action/reward (302), determine maximum target network output (304), determine error (306), compute gradient (308).Search in Eureka ↗
Flow diagram
FIG. 4
Flow diagram of process 400 for performing a SARSA iteration: receive inputs (402), select next action (304), determine target network output (406), determine error (408), compute gradient using error (408).Search in Eureka ↗
Flow diagram
FIG. 5
Flow diagram of process 500 for training a policy neural network: determine policy network parameters (502), receive observations and select actions (504), determine long-term reward (506), compute per-observation errors (508), gradient updates (510-512), conditionally write to shared memory (514-520).Search in Eureka ↗
Flow diagram
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Claims
Claim Architecture Analysis
The patent presents 3 independent claims: Claim 1 (method), Claim 9 (system), and Claim 17 (non-transitory computer storage media/CRM), each covering the asynchronous multi-worker training architecture with per-worker exploration policies. The dependent-to-independent ratio of 5.67:1 is close to the software/AI norm, with 17 dependent claims adding exploration policy variants (Claims 2–4, 10–12, 18–20), gradient accumulation mechanisms (Claims 5–6, 13–14), and implementation details (Claims 7–8, 15). The tripartite independent claim structure provides enforcement coverage across method, system, and storage medium formats, though the claim bodies are notably verbose, potentially creating prosecution narrowing risks.
Core inventive concept: The claims target the specific problem of slow, communication-heavy synchronous deep RL training by disclosing a system of plural independently-operating workers, each associated with a respective actor and environment replica, wherein each worker's exploration policy is "parameterized by a set of exploration policy parameters" that are "specific to the worker and are different from values of exploration policy parameters of each of one or more other workers" — enabling diverse parallel exploration without requiring a replay memory or inter-worker synchronization.
Independent Claim Dissection
Claim
Preamble
Transition
Key Body Elements
Claim 1
A method of training a deep neural network having a plurality of parameters that is used to select actions to be performed by an agent that interacts with an environment by performing actions selected from a predetermined set of actions
comprising
using a plurality of workers to generate training data; each worker operates independently, associated with a respective actor and environment replica with a distinct exploration policy; each worker repeatedly determines current DNN parameters, receives observations, selects actions, receives rewards, accumulates training data; applying reinforcement learning technique to determine current gradients; determining updated DNN parameter values using gradientsSearch prior art ↗
Claim 9
A system
comprising
one or more computers; one or more storage devices storing instructions that when executed cause the computers to perform operations for training a deep neural network using a plurality of workers, each worker independently operating with distinct exploration policy parameters, generating training data via actor-environment replica interaction, applying reinforcement learning technique to determine gradients, and determining updated DNN parameter valuesSearch prior art ↗
Claim 17
One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations to train an industrial plant controller that controls operation of an industrial plant
comprising
training a deep neural network with a plurality of parameters for agent action selection; using plural independent workers each with a distinct exploration policy; each worker generates training data via repeated observation-action-reward cycles with environment replicas; applying reinforcement learning technique to determine gradients; determining updated DNN parameter values using gradientsSearch prior art ↗
Claim Dependency Tree
1 Method: training DNN via plural independent workers, each with distinct exploration policy, accumulating training data, applying RL technique to determine gradientsSearch Claim 1 prior art ↗
2 Adds: epsilon-greedy exploration policy with per-worker different epsilon probability parameterSearch in Eureka ↗
3 Adds: sampling new epsilon value from probability distribution when update criterion is met (depends on Claim 2)Search in Eureka ↗
4 Adds: softmax temperature parameter tau for per-worker exploration, sampling action from probability distributionSearch in Eureka ↗
5 Adds: per-worker application of RL technique to worker-generated training data to generate per-worker current gradientsSearch in Eureka ↗
6 Adds: per-worker accumulated gradient update, shared memory criteria check, conditional write of updated parameter values to shared memory (depends on Claim 5)Search in Eureka ↗
7 Adds: each worker executes independently on same computerSearch in Eureka ↗
8 Adds: DNN is Q network generating Q values for observation-action pairs, epsilon-greedy action selection using Q valuesSearch in Eureka ↗
9 System: one or more computers with storage devices executing instructions for multi-worker asynchronous DNN training with distinct per-worker exploration policiesSearch Claim 9 prior art ↗
10 Adds: epsilon-greedy exploration policy with per-worker different epsilon probability (mirrors Claim 2)Search in Eureka ↗
11 Adds: sampling new epsilon from probability distribution when criterion satisfied (depends on Claim 10)Search in Eureka ↗
12 Adds: softmax temperature tau for per-worker action selection from probability distributionSearch in Eureka ↗
13 Adds: per-worker RL technique applied to worker-generated training data for per-worker gradientsSearch in Eureka ↗
14 Adds: accumulated gradient update, shared memory criteria check, conditional write of updated parameters (depends on Claim 13)Search in Eureka ↗
15 Adds: each worker executes independently on same computerSearch in Eureka ↗
16 Adds: DNN is Q network, Q value generation for observation-action pairs, epsilon-greedy selection using Q valuesSearch in Eureka ↗
17 CRM: non-transitory computer storage media storing instructions to train industrial plant controller via asynchronous multi-worker DNN training with distinct per-worker exploration policiesSearch Claim 17 prior art ↗
18 Adds: epsilon-greedy exploration policy with per-worker different epsilon (mirrors Claims 2, 10)Search in Eureka ↗
19 Adds: sampling new epsilon from probability distribution when criterion satisfied (depends on Claim 18)Search in Eureka ↗
20 Adds: softmax temperature tau for per-worker action selection via probability distribution samplingSearch in Eureka ↗
Metric
This Application
Software / AI Industry Norm
Total claims
20
15 – 25
Independent claim count
3
2 – 4
Dependent : Independent ratio
5.67 : 1
4 – 8 : 1
Method claims present?
Yes — Claim 1
Common
System / apparatus claims?
Yes — Claim 9
Common
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Drafting Quality
Drafting Quality Signals
The patent demonstrates strong structural coverage through its tripartite independent claim architecture (Claims 1, 9, 17) and provides adequate written description support in the detailed description for the core worker-based asynchronous training mechanism. However, the identical mirroring of dependent claims across all three independent claims (Claims 2–8 ≈ Claims 10–16 ≈ Claims 18–20) reduces the quality of fallback positions and creates a significant §101 exposure risk given the purely functional claim language describing the RL training process without hardware-specific structural anchors.
✅
Antecedent Basis
Antecedent basis is generally clean throughout the 20 claims. In Claim 1, "the worker" correctly refers back to "each worker" introduced in the preceding limitation, and "the actor associated with the worker" properly references "a respective actor" established earlier. "The deep neural network" in the gradient-determining step traces back to the preamble's "a deep neural network." No floating "the" references were identified across Claims 2–20 that lack proper antecedent basis.
FIG. 1 and the detailed description at col. 3–4 directly map to the "plurality of workers," "respective actor," and "shared memory" limitations of Claims 1, 9, and 17. FIG. 2 (steps 202–220) provides direct support for Claim 6's accumulated gradient update and conditional shared-memory write limitation. The Q-learning and SARSA embodiments in FIGS. 3–4 support the RL technique application recited in the independent claims, and FIG. 5 supports the policy network variant. No independent claim limitation lacks a corresponding description passage.
All three independent claims use "comprising" — the correct open-ended transition for a software/AI patent that should not exclude implementations with additional components such as a centralized parameter server or experience replay buffer. Claim 17's CRM preamble uses "storing instructions" followed by "comprising" for the operations, which is the accepted format for CRM claims at the USPTO. No restrictive "consisting of" or "consisting essentially of" transitions appear, which is strategically appropriate for a training system architecture where additional operational steps are likely.
No "means for" or "step for" language appears in any of the 20 claims, eliminating direct §112(f) invocation risk. The system claim (Claim 9) is drafted as "one or more computers" and "one or more storage devices" with recited structural components, rather than functional "means" elements. Functional language in the claims (e.g., "configured to operate independently," "configured to generate training data") is anchored to the structural computer/storage device elements, which courts have generally held sufficient to avoid §112(f) interpretation under Williamson v. Citrix.
Claims 1, 9, and 17 present moderate Alice/Mayo exposure because the core inventive concept — asynchronous parallel RL training with diverse per-worker exploration policies — is framed as an abstract mathematical/computational process rather than a concrete hardware improvement. The §101 defense rests primarily on Claim 9's "one or more computers" and Claim 17's CRM anchor; however, in an inter partes review or district court challenge, an adversary could argue the exploration policy diversification is a mathematical concept practiced on generic computers. The specification's reference to FPGA/ASIC implementations (col. 9–10) is helpful but not reflected in the claims as a structural limitation.
The 17 dependent claims are almost entirely duplicated across the three independent claims: Claims 2–8 mirror Claims 10–16, which mirror Claims 18–20 (truncated). This symmetrical mirroring adds enforcement coverage across claim types but provides no genuinely distinct technical fallback positions beyond what is already in Claims 2–8. Notably, Claims 3 and 19 (epsilon sampling from distribution) and Claims 6 and 14 (accumulated gradient with conditional shared-memory write) are the strongest fallbacks, adding specific mechanisms. However, there is no dependent claim directed to the target network synchronization frequency described at col. 5–6, nor to the specific RMSProp update rule mentioned in the spec, representing missed fallback opportunities.
The abstract states: "One of the systems includes a plurality of workers, wherein each worker is configured to operate independently of each other worker, and wherein each worker is associated with a respective actor that interacts with a respective replica of the environment during the training of the deep neural network." This accurately describes the system architecture but omits the distinguishing feature — per-worker diversified exploration policy parameterization — which is the key claim element over prior parallel RL approaches. An examiner reading only the abstract might not appreciate why the per-worker exploration policy differentiation is the novel contribution, potentially leading to an examiner search that misses relevant prior art combinations.
FIG. 1 supports all structural system claim limitations in Claims 9 and 17, clearly depicting workers (102A-N), actors (104A-N), environment replicas (106A-N), and shared memory (110). FIGS. 2 and 5 support the gradient accumulation and conditional shared-memory write steps in Claims 6 and 14. FIGS. 3 and 4 support Q-learning (Claim 8/16) and SARSA reinforcement learning technique embodiments referenced in the independent claims. However, no figure depicts the per-worker exploration policy parameter differentiation — the core novel element — which is described only in text, leaving the most important limitation without diagrammatic support.
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Scorecard
Strategic Intent Scorecard
Multi-dimensional assessment of this application's patent strategy quality, based on claim structure, specification depth, and prosecution positioning.
Claim Breadth
3.5
Prosecution Defensibility
3.8
Spec–Claim Consistency
4.2
Dependent Claim Coverage
2.5
Claim Type Diversity
4.5
Figure Support Quality
3.5
Key observation: Claim Type Diversity scores highest (4.5/5.0) because the tripartite structure of method (Claim 1), system (Claim 9), and CRM (Claim 17) — with the CRM claim specifically tied to an industrial plant controller application — provides unusually robust enforcement options across different defendant profiles (software developers, cloud platform operators, and industrial automation vendors). Dependent Claim Coverage scores lowest (2.5/5.0) because the 17 dependent claims are mechanically duplicated across all three independent claims with no unique technical fallbacks — specifically, no claims are directed to the target network synchronization mechanism, the RMSProp optimizer variant, or the asynchronous gradient conflict resolution described in the specification. Practitioners should note that the absence of specific hardware-tied dependent claims creates vulnerability in any §101 challenge under Alice's second step.
A senior-attorney lens on the three highest-priority structural weaknesses — what each exposes in prosecution and litigation, and what a stronger filing would have done differently.
GAP 01 · HIGHEST IMPACT
No Claims Cover Asynchronous Target Network Synchronization Mechanism
Claims 1, 9, and 17 recite the core worker-based training loop but do not claim the asynchronous target network synchronization mechanism described at col. 5–6, wherein each worker periodically synchronizes target network parameters from shared memory less frequently than it writes parameter updates — a critical implementation detail distinguishing this architecture from naive asynchronous Q-learning. This creates a design-around path where a competitor could implement asynchronous parallel RL with a synchronized (rather than periodically-desynchronized) target network, avoiding Claim 8/16's Q-network limitation entirely while capturing the performance benefit. A stronger filing would have included a dependent claim reciting the criterion-based asynchronous target network synchronization with the specific threshold condition described at step 214/514 of FIGS. 2 and 5.
GAP 02 · HIGH IMPACT
CRM Claim Unnecessarily Narrowed to Industrial Plant Controller
Claim 17's CRM preamble is uniquely restricted to storing instructions "to train an industrial plant controller that controls operation of an industrial plant" — a limitation absent from both the method Claim 1 and system Claim 9, which recite only a generic "agent that interacts with an environment." This asymmetric restriction exposes a significant coverage gap: a competitor deploying asynchronous deep RL for robotics, game AI, autonomous vehicle control, or financial trading systems can operate outside Claim 17 while potentially being within Claims 1 and 9. If Claims 1 and 9 are invalidated, Claim 17 alone would provide no protection for the majority of real-world RL deployment scenarios. A stronger filing would have maintained domain-agnostic language in the CRM independent claim, reserving the industrial plant controller as a dependent claim embodiment.
GAP 03 · HIGH IMPACT
No Apparatus Claim for Distributed Multi-Machine Architecture
Unlock to read the full analysis.
🔒
3 Critical Gaps in This Claim Set
See the full attorney-level analysis of what this application leaves unprotected — and how to draft it more defensively for your own filings.
US 11,783,182 B2 protects methods, systems, and computer-readable media for asynchronous deep reinforcement learning using a plurality of independently-operating workers, each associated with a distinct actor and environment replica with per-worker exploration policy parameters. The patent specifically claims the mechanism where each worker's exploration policy is parameterized differently from every other worker, enabling diverse parallel exploration of an environment without requiring a centralized replay memory, while workers collectively train a shared deep neural network by writing gradient-based parameter updates to a common shared memory.
US 11,783,182 B2 is owned by DeepMind Technologies Limited, London, GB. The inventors are Volodymyr Mnih (Toronto, CA), Adrià Puigdomènech Badia (London, GB), Alexander Benjamin Graves (London, GB), Timothy James Alexander Harley (London, GB), David Silver (Hitchin, GB), and Koray Kavukcuoglu (London, GB).
Claim 1 is a method claim covering training a deep neural network using plural independent workers each with a distinct exploration policy to generate training data, applying a reinforcement learning technique to determine gradients, and updating DNN parameters. Claim 9 is a system claim covering one or more computers with storage devices that when executed implement the same multi-worker asynchronous DNN training operations. Claim 17 is a CRM claim covering non-transitory computer storage media storing instructions that when executed train an industrial plant controller using the same asynchronous multi-worker deep RL method.
This patent covers a way to train AI systems faster by having many software workers learn simultaneously and independently. Instead of one worker learning at a time, multiple workers each observe a copy of the environment, try different strategies (with each worker using a slightly different exploration approach), and periodically share what they've learned with a central memory. This removes the need for a large experience replay database and allows training to scale across many processors, making AI training significantly faster and more efficient.
G06N 3/08 (2023.01) — Learning methods for artificial neural networks. G06N 3/045 (2023.01) — Combinations of neural networks in artificial neural network architectures. G06N 3/04 (2013.01) — Architectures of artificial neural networks.
Still have questions? PatSnap Eureka can answer them from patent data instantly. Search in Eureka
PatSnap Eureka
Ready to Draft Your Next Patent with AI?
PatSnap Eureka's AI drafting agent writes structured claims, flags coverage gaps, and positions your application for prosecution success.
Disclaimer: This analysis is generated by PatSnap Eureka AI based on publicly available patent data from the USPTO. It does not constitute legal advice and should not be relied upon as such. Patent data may be subject to change as prosecution progresses. Scores and assessments reflect automated analysis and may not capture all relevant legal or technical nuances. Always consult a qualified patent attorney for formal legal opinions on patentability, freedom to operate, or infringement.
Ask anything about this patent. PatSnap Eureka searches patents and data to answer instantly.