To start using PatSnap Eureka, click the verification button in the email we sent to .
This helps keep your account secure. Haven't received it? Check your spam folder.
Patent Drafting Analysis of Google Inc.’s Video Annotation Using Deep Network Architectures | US 9,330,171 B1
Patent Drafting Analysis of Google Inc.’s Video Annotation Using Deep Network Architectures | US 9,330,171 B1
IP Drafting Analysis · US 9,330,171 B1
Patent Drafting Analysis of Google Inc.'s Video Annotation Using Deep Network Architectures | US 9,330,171 B1
A structural and strategic analysis of Google's CNN-based video annotation patent, examining claim architecture, drafting quality, critical gaps, and prosecution positioning across method, CRM, and system claim types.
US 9,330,171 B1Filed: Jan 22, 2014Granted: May 3, 2016G06K 9/64G06F 17/30G06T 3/00
Published byPatSnap Insights Team · · 11 min read Verified by PatSnap Eureka Data
Overview
Structural Overview
The detailed description dominates at approximately 55% of total specification words (~3,200 words), with substantial attention to five CNN fusion models (FIGS. 3A–3E) and the bi-channel decomposition approach (FIG. 4). The claim architecture comprises 18 claims across 3 independent claims — a method (Claim 1), a CRM (Claim 9), and a system (Claim 15) — yielding a 5:1 dependent-to-independent ratio that is lean but adequate for this software/AI IPC class. The 11 figure sheets provide strong visual support for the CNN architectures and process flows but contain no UI/interface depictions, leaving search-result presentation aspects underpowered.
Section Word Distribution
↗ Click bars to explore
Figure Inventory — 11 Sheets
Figure
Description
Role
FIG. 1
System architecture 100 showing content sharing platform 108 with annotation subsystem 112, connected to data stores 102A–102Z and client devices 110A–110Z via network 106.Search in Eureka ↗
System architecture
FIG. 2
Operation of annotation subsystem 112 converting raw video contents 202 (with hyperlinks 204A–204Z) into annotated video content 206 with annotations 210A–210Z.Search in Eureka ↗
Claim support
FIG. 3A
Single-frame CNN model 310 applied to video content 300, showing filter/pooling layers 302.1–302.6, hidden layers 304/306, and softmax layer 308 processing one frame at time t.Search in Eureka ↗
Key embodiment
FIG. 3B
Early fusion CNN model processing a stack of consecutive video frames F7–F11 together through shared filter/pooling layers 302.1–302.6 into connected neuron network 314.Search in Eureka ↗
Key embodiment
FIG. 3C
Late fusion CNN model where non-consecutive frames F1, F7, and F16 are each individually processed through separate filter/pooling layers 302.1–302.6 before joining connected neuron network 314.Search in Eureka ↗
Key embodiment
FIG. 3D
Hybrid fusion CNN model grouping frames F1–F4, F6–F9, and F13–F16 into three groups, each processed by filter/pooling layers 302.1–302.6 before feeding into connected neuron network 314.Search in Eureka ↗
Key embodiment
FIG. 3E
Progressive fusion CNN model showing hierarchical filter/pooling layers 302.1–302.6 with increasing temporal extent across lower, middle, and higher levels applied to video content 300.Search in Eureka ↗
Key embodiment
FIG. 4
Bi-channel decomposition model showing subsample decomp (lower-resolution 3×3 grid) and fovea decomp (center-cropped region) fed into separate filter/pooling layer stacks 302.1–302.6 entering the same CNN.Search in Eureka ↗
Key embodiment
FIG. 5
Flow diagram of basic annotation method: retrieve video content (502), select stack of frames (504), apply CNN to generate annotation (506), make annotated content searchable (508).Search in Eureka ↗
Flow diagram
FIG. 6
Flow diagram of bi-channel annotation method: receive video (602), select frame stack (604), subsample frames for first representation (606), select sub-region for second representation (608), apply CNN to both (610).Search in Eureka ↗
Flow diagram
FIG. 7
Block diagram of computer system 700 including processor 702, main memory 704, static memory 706, data storage device 718 with computer-readable medium 724, network interface 722, and bus 708.Search in Eureka ↗
Claim support
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Claims
Claim Architecture Analysis
The patent contains 3 independent claims: Claim 1 (method), Claim 9 (non-transitory machine-readable storage medium / CRM), and Claim 15 (system), providing a tripartite structure across all major claim types. The 15 dependent claims yield a 5:1 dependent-to-independent ratio, which is below the software/AI industry norm of 7–10:1, leaving the claim set with fewer fallback positions than optimal. The tripartite structure provides enforcement coverage across method execution, stored-instruction, and hardware-system formats, but the dependent claims are largely parallel across the three independent claims rather than additive.
Core inventive concept: The claims address the computational cost of processing full-resolution video frames through a CNN by decomposing each frame into two complementary representations: a spatially subsampled first representation at lower resolution and a sub-region (fovea) second representation "at the first resolution" covering a smaller spatial area — then executing the CNN with both representations as separate inputs to generate a video annotation, as recited in Claims 1, 9, and 15.
Independent Claim Dissection
Claim
Preamble
Transition
Key Body Elements
Claim 1
A method
comprising
receiving video content by processing device of content sharing platform; selecting at least one video frame at first resolution; subsampling frame to generate first representation at lower second resolution; selecting sub-region at first resolution to generate second representation covering smaller spatial area; executing CNN using first representation as first input and second representation as second input to generate annotationSearch prior art ↗
Claim 9
A non-transitory machine-readable storage medium storing instructions which, when executed, cause a processing device to perform operations
comprising
receiving video content; selecting at least one video frame at first resolution; subsampling frame to generate lower-resolution first representation; selecting sub-region at first resolution to generate smaller-area second representation; executing CNN using first and second representations as separate inputs to generate annotationSearch prior art ↗
Claim 15
A system
comprising
a memory; a processor operatively coupled to memory configured to: receive video content; select at least one video frame at first resolution; subsample frame to generate lower-resolution first representation; select sub-region at first resolution to generate smaller-area second representation; execute CNN using first and second representations as separate inputs to generate annotationSearch prior art ↗
Claim Dependency Tree
1 Method: receive video, subsample frame (first rep), select sub-region (second rep), execute CNN on both inputs to generate annotationSearch Claim 1 prior art ↗
2 Adds: second representation is a fovea representation at same spatial sampling rate as the video frameSearch in Eureka ↗
4 Adds: at least one video frame includes one of two consecutive or at least two non-consecutive framesSearch in Eureka ↗
5 Adds: CNN includes at least one convolution layer, at least one pooling layer, and a connected neuron networkSearch in Eureka ↗
6 Further: first convolution/pooling layer applied to first number of frames, second to different number of frames (depends on Claim 5)Search in Eureka ↗
7 Further: earlier CNN layer applied to higher number of frames than later layer — progressive fusion (depends on Claim 5)Search in Eureka ↗
8 Adds: further comprising making video content searchable according to the annotationSearch in Eureka ↗
9 CRM: store instructions to receive video, subsample frame (first rep), select sub-region (second rep), execute CNN on both inputs to generate annotationSearch Claim 9 prior art ↗
10 Adds: second representation is fovea at same spatial sampling rate as video frameSearch in Eureka ↗
12 Adds: at least one video frame includes at least two consecutive or at least two non-consecutive framesSearch in Eureka ↗
13 Adds: CNN includes at least one convolution layer, at least one pooling layer, and connected neuron networkSearch in Eureka ↗
14 Further: first convolution/pooling applied to first number of frames, second to different number (depends on Claim 11)Search in Eureka ↗
15 System: memory + processor to receive video, subsample frame (first rep), select sub-region (second rep), execute CNN on both inputs to generate annotationSearch Claim 15 prior art ↗
16 Adds: second representation is fovea at same spatial sampling rate as video frameSearch in Eureka ↗
17 Adds: CNN includes at least one convolution layer, at least one pooling layer, and connected neuron networkSearch in Eureka ↗
18 Further: first convolution/pooling applied to first number of frames, second to different number (depends on Claim 17; titled "user device of claim 17")Search in Eureka ↗
Metric
This Application
Software / Cloud Norm
Total claims
18
15 – 25
Independent claim count
3
2 – 4
Dependent : Independent ratio
5.00 : 1
5 – 10 : 1
Method claims present?
Yes — Claim 1
Always
System / apparatus claims?
Yes — Claim 15
Common
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Drafting Quality
Drafting Quality Signals
The filing's strongest dimension is its tripartite independent claim structure (Claims 1, 9, 15) ensuring enforcement coverage across method, CRM, and system formats, with each independent claim tightly anchored to the bi-channel input mechanism. The most significant weakness is the heavy parallelism of dependent claims across the three independent claims, with nearly identical limitations repeated in Claims 2/10/16, 5/13/17, and 6/14/18, rather than introducing genuinely distinct fallback positions that could survive a broader rejection.
✅
Antecedent Basis
Antecedent basis is generally clean across all 18 claims. In Claim 1, "the convolutional neuron network" is properly introduced as "a convolutional neuron network" in the preceding limitation. Similarly, "the first representation" and "the second representation" each have clear antecedents in their respective subsampling and sub-region selection steps. One minor flag: Claim 18 refers to "The user device of claim 17" though Claim 15 is a "system" claim — this preamble inconsistency could invite an examiner objection regarding claim clarity under §112(b).
All independent claim limitations map directly to specification sections and figures. The subsampling-to-first-representation limitation in Claim 1 is supported by FIG. 4 ("SUBSAMPLE DECOMP") and the detailed description at columns 8–9 describing pixel discarding. The sub-region (fovea) second representation maps to FIG. 4 ("FOVEA DECOMP") and columns 8–10 describing the central crop. The CNN execution with dual inputs maps to the bi-channel model described in col. 9 and FIG. 4, with the connected neuron network structure supported by FIGS. 3A–3E and col. 6–8.
All three independent claims use "comprising" as the transition, which is the strategically correct choice for an open-ended software/AI method — it avoids foreclosing embodiments that include additional steps such as the search-availability step in dependent Claim 8. No claim uses "consisting of" or "consisting essentially of," which would inappropriately narrow scope in this AI/software domain. The "comprising" transition properly permits infringement even where an accused system includes additional preprocessing steps not recited in Claims 1, 9, or 15.
No explicit "means for" language appears in the claims, reducing §112(f) invocation risk. However, Claim 15's processor limitations use purely functional language — "to receive," "to select," "to subsample," "to execute" — without reciting specific structural components beyond "a processor, operatively coupled to the memory." Under post-Williamson v. Citrix doctrine, a challenger could argue that these nonce functional recitations invoke §112(f) for the processor element, particularly as the specification does not separately define the processor's structure for each function beyond the generic FIG. 7 computer system.
The claims carry moderate Alice exposure because the core concept — applying a CNN to two representations of a video frame to generate an annotation — could be characterized at step one as an abstract idea of mathematical processing of data. The hardware tie-in in Claims 1 and 9 is thin: Claim 1 recites only "a processing device of a content sharing platform" and Claim 9 is a CRM claim. Claim 15's system recitation of a physical "memory" and "processor" provides the strongest §101 defense. The specific dual-input architecture (first and second representations as separate CNN inputs) constitutes a technical improvement over prior art feature-extraction approaches, as articulated in col. 3, which should support an Enfish-style §101 argument.
The dependent claims provide limited genuinely distinct fallback positions because the key limitations are simply replicated across all three independent claims: Claims 2, 10, and 16 all add the fovea same-sampling-rate limitation; Claims 5, 13, and 17 all add the CNN structural components; Claims 6, 14, and 18 all add the multi-level frame-count differentiation. Genuinely valuable fallbacks are Claims 7 (progressive fusion — earlier layer covers more frames) and Claim 8 (making content searchable), which appear only in the method chain and have no CRM or system equivalents, creating unprotected gaps.
The abstract ("A method includes receiving... a video content, selecting at least one video frame... subsampling... selecting a sub-region... and applying a convolutional neuron network to the first and second representations") accurately identifies the method steps but omits the novel architectural contribution — that the CNN receives both representations simultaneously as separate first and second inputs, which is the key distinguishing mechanism over prior art single-input CNN approaches. An examiner reading only the abstract would likely characterize this as generic CNN-based video annotation without appreciating the bi-channel dual-input architecture.
Figure support for the independent claim limitations is strong. FIG. 4 directly maps to the dual-input bi-channel architecture of Claims 1/9/15 with labeled SUBSAMPLE DECOMP and FOVEA DECOMP components feeding separate CNN channel stacks. FIGS. 3A–3E support the CNN architecture limitations of Claims 5/13/17 with detailed depictions of filter/pooling layers 302.1–302.6, hidden layers 304/306, and softmax 308. FIG. 6 provides process-level support for the method steps of Claim 1. The one gap is the absence of any figure depicting the annotation output format (keywords, tags, ranked lists) referenced in the detailed description.
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Scorecard
Strategic Intent Scorecard
Multi-dimensional assessment of this application's patent strategy quality, based on claim structure, specification depth, and prosecution positioning.
Claim Breadth
3.5
Prosecution Defensibility
3.2
Spec–Claim Consistency
4.2
Dependent Claim Coverage
2.8
Claim Type Diversity
4.5
Figure Support Quality
4
Key observation: The highest-scoring dimension is Claim Type Diversity (4.5/5.0) — Google's tripartite structure across method (Claim 1), CRM (Claim 9), and system (Claim 15) ensures enforcement coverage regardless of how an accused infringer implements the annotation process, a well-executed prosecution strategy. The lowest-scoring dimension is Dependent Claim Coverage (2.8/5.0) — the dependent claims largely mirror each other across the three independent claim chains (e.g., Claims 2/10/16, 5/13/17, 6/14/18 are near-identical), with the valuable progressive fusion limitation of Claim 7 and the searchability step of Claim 8 appearing only in the method chain and having no CRM or system equivalents. Practitioners reviewing this patent for continuation or design-around purposes should note that adding equivalent progressive fusion and search-availability dependent claims to the CRM and system chains in a continuation filing would substantially strengthen the fallback positions.
A senior-attorney lens on the three highest-priority structural weaknesses — what each exposes in prosecution and litigation, and what a stronger filing would have done differently.
GAP 01 · HIGHEST IMPACT
Progressive Fusion Limitation Absent from CRM and System Claims
The progressive fusion limitation — where an earlier CNN layer is applied to a higher number of video frames than a later layer — appears only as Claim 7 (dependent on Claim 5, dependent on method Claim 1) and has no equivalent in the CRM chain (Claims 9–14) or the system chain (Claims 15–18). This structural gap means a competitor implementing progressive fusion on a server (system claim) or in stored software (CRM claim) can design around the patent's most technically sophisticated embodiment while the method claim remains enforceable. A stronger filing would have added parallel progressive-fusion dependent claims under both Claims 9 and 15 to close this design-around corridor.
GAP 02 · HIGH IMPACT
No Claim on Training Data Generation or Automated Labeling
The specification at column 4 describes a technically significant and commercially valuable feature: automatically generating large-scale training data by harvesting already-annotated video clips from the Internet (e.g., videos tagged "mountain biking") and using them to train the CNN without manual labeling. This automated training pipeline is described as a key advantage over prior art feature-based approaches that require manual labeling. However, no independent or dependent claim recites any limitation relating to training data generation, automated label harvesting, or the CNN training process itself — leaving this entire layer of the invention unprotected. A stronger filing would have included at least one independent method claim covering the automated training-data generation step.
GAP 03 · HIGH IMPACT
No Apparatus Claim for Content Sharing Platform as Standalone System
Unlock to read the full analysis.
🔒
3 Critical Gaps in This Claim Set
See the full attorney-level analysis of what this application leaves unprotected — and how to draft it more defensively for your own filings.
Progressive fusion unprotected in CRM/systemAutomated training pipeline unclaimedPlatform-level apparatus claim missing
US 9,330,171 B1 protects a method, non-transitory machine-readable storage medium, and system for automatically generating annotations for video content by applying a convolutional neuron network (CNN). The specific technical problem solved is the computational cost of processing full-resolution video frames through a CNN; the solution uses a dual-input bi-channel approach where the video frame is simultaneously decomposed into a low-resolution subsampled representation and a full-resolution fovea sub-region representation, both fed as separate inputs to the CNN to generate the annotation.
US 9,330,171 B1 is owned by GOOGLE INC., headquartered in Mountain View, California, USA. The inventors are Sanketh Shetty (Sunnyvale, CA), Andrej Karpathy (Stanford, CA), and George Dan Toderici (Mountain View, CA).
Claim 1 is a method claim covering the steps of receiving video content, selecting a video frame, subsampling it to generate a lower-resolution first representation, selecting a sub-region at original resolution to generate a second fovea representation, and executing a CNN with both representations as separate inputs to generate an annotation. Claim 9 is a CRM (non-transitory machine-readable storage medium) claim reciting the same operations as stored instructions. Claim 15 is a system claim comprising a memory and a processor configured to perform the same video frame decomposition and CNN annotation operations.
This patent covers a technique for automatically labeling (annotating) video clips using artificial intelligence. Instead of having humans watch videos and type in descriptive keywords, the system uses a type of AI called a convolutional neural network (CNN) to analyze video frames and generate descriptive tags automatically. The innovation is that rather than feeding the AI the entire high-resolution video frame — which would be slow and computationally expensive — the system feeds it two smaller pieces simultaneously: a downsampled low-resolution version of the whole frame, and a full-resolution crop of just the center region where the action is typically happening.
G06K 9/64 (2006.01) — Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints, using a combination of methods. G06F 17/30 (2006.01) — Information retrieval; Database structures therefor; File system structures therefor. G06T 3/00 (2006.01) — Geometric image transformation in the plane of the image.
Still have questions? PatSnap Eureka can answer them from patent data instantly. Search in Eureka
PatSnap Eureka
Ready to Draft Your Next Patent with AI?
PatSnap Eureka's AI drafting agent writes structured claims, flags coverage gaps, and positions your application for prosecution success.
Disclaimer: This analysis is generated by PatSnap Eureka AI based on publicly available patent data from the USPTO. It does not constitute legal advice and should not be relied upon as such. Patent data may be subject to change as prosecution progresses. Scores and assessments reflect automated analysis and may not capture all relevant legal or technical nuances. Always consult a qualified patent attorney for formal legal opinions on patentability, freedom to operate, or infringement.
Ask anything about this patent. PatSnap Eureka searches patents and data to answer instantly.