To start using PatSnap Eureka, click the verification button in the email we sent to .
This helps keep your account secure. Haven't received it? Check your spam folder.
Patent Drafting Analysis of OpenAI OpCo, LLC’s Multimodal Machine Learning Model Interaction System | US 12,039,431 B1
Patent Drafting Analysis of OpenAI OpCo, LLC’s Multimodal Machine Learning Model Interaction System | US 12,039,431 B1
IP Drafting Analysis · US 12,039,431 B1
Patent Drafting Analysis of OpenAI OpCo, LLC's Multimodal Machine Learning Model Interaction System | US 12,039,431 B1
A structural and strategic analysis of OpenAI's granted patent covering GUI-based contextual prompt interaction with multimodal LLMs, examining claim architecture, drafting quality, critical gaps, and prosecution positioning across method and system claim types.
Published byPatSnap Insights Team · · 12 min read Verified by PatSnap Eureka Data
Overview
Structural Overview
The detailed description dominates at approximately 63% of total words (~5,800 words), reflecting thorough operational scenario coverage across all six figure groups (FIGs. 3A–3F, 4, 5A–5B, 6A–6B, 7, 8). The claim set comprises exactly 20 claims — 2 independent (method Claim 1, system Claim 13) and 18 dependent — yielding a 9:1 dependent-to-independent ratio that provides layered fallback but concentrates risk on just two independent claims. Figure coverage spans 12 drawing sheets addressing UI states, process flows, hardware environments, and ML platform architecture, though no figure explicitly depicts the tokenization concatenation process that is central to Claims 2–5 and 14–16.
Section Word Distribution
↗ Click bars to explore
Figure Inventory — 12 Sheets
Figure
Description
Role
FIG. 1
System architecture showing Image Processing System (120), Machine Learning System (130), Network (110), and User Device (140) interconnected.Search in Eureka ↗
System architecture
FIG. 2
Flow diagram of Method 200 showing five sequential steps: Provide GUI (210), Receive Contextual Prompt (220), Generate Input Data (230), Generate Textual Response (240), Provide Textual Response (250).Search in Eureka ↗
Flow diagram
FIG. 3A
UI view (300a) showing Image 310 (a garden of herbs) displayed on a user device GUI before any contextual prompt is applied.Search in Eureka ↗
UI/interface
FIG. 3B
UI view (300b) showing Image 310 with a small Loupe annotation (320a) applied as a contextual prompt indicating a two-leaf area of emphasis.Search in Eureka ↗
UI/interface
FIG. 3C
UI view (300c) showing Image 310 with a Resized Loupe (320b) that now encompasses the whole plant, demonstrating the annotation resize feature.Search in Eureka ↗
Claim support
FIG. 3D
UI view (300d) showing Image 310 with a crudely drawn cross Mark (300a) as a contextual prompt indicating an area of emphasis.Search in Eureka ↗
UI/interface
FIG. 3E
UI view (300e) showing Image 310 with an unclosed loop Mark (300b) as a differently shaped contextual prompt annotation.Search in Eureka ↗
UI/interface
FIG. 3F
UI view (300f) showing Image 310 with a segmented object (340) highlighted as a contextual prompt, demonstrating segmentation-tool annotation.Search in Eureka ↗
Claim support
FIG. 4
Flow diagram of Method 400 showing the prompt suggestion generation loop: Obtain Contextual Prompt (410), Generate Input Data (420), Generate Prompt Suggestion (430), Display GUI with suggestions (440), branch to Receive Updated Prompt (450a) or Receive Selection (450b), Generate Response (460), Provide Response (470).Search in Eureka ↗
Flow diagram
FIG. 5A
UI view (500a) showing Image 310 displayed with two prompt suggestions (510a "What Plants are These?", 510b "What Greens Are Good for Salads?") and Textual Prompt Input Interface (520) when no contextual prompt has been applied.Search in Eureka ↗
UI/interface
FIG. 5B
UI view (500b) showing Image 310 with Resized Loupe (320b) applied and updated prompt suggestions (510c "What kind of basil is this?", 510d "Basil uses in cooking") conditioned on the loupe annotation.Search in Eureka ↗
Claim support
FIG. 6A
Split-screen UI view (600a) showing Image 310 with Mark 610 at one location alongside prompt suggestions (620a) "What kind of basil is this?" and "Basil uses in cooking", with Textual Prompt Input Interface 520.Search in Eureka ↗
UI/interface
FIG. 6B
Split-screen UI view (600b) showing Image 310 with Mark 610 moved to a different location generating new prompt suggestions (620b) "Why is cilantro soapy?" and "Recipes using cilantro", illustrating location-dependent prompt suggestions.Search in Eureka ↗
Claim support
FIG. 7
Block diagram of computing environment 700 showing Computing Device 702 with Memory 704, Processor 706, Data Storage 708, Other Hardware 710, User Interface 712, Network Interface 714, connected to I/O Devices 718 and Networks 716, with Configured Medium 720.Search in Eureka ↗
System architecture
FIG. 8
Block diagram of ML platform 800 showing Data Input Engine 810, Featurization Engine 820, ML Modeling Engine 830 (with Model Selector 832, Parameter Engine 834, Model Generation 836), Predictive Output Generation Engine 840, Output Validation Engine 850, Model Refinement Engine 860, Feedback Engine 870, Outcome Metric 880, and ML Algorithms Database 890.Search in Eureka ↗
System architecture
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Claims
Claim Architecture Analysis
The patent contains exactly 2 independent claims — method Claim 1 and system Claim 13 — providing dual claim-type coverage but no computer-readable medium (CRM) claim, which is a notable structural gap. The 18 dependent claims yield a 9.0:1 dependent-to-independent ratio, well above the software/AI industry norm of 4–8:1, creating extensive fallback positions. The symmetric structure mirrors Claims 2–12 (method dependents) against Claims 14–20 (system dependents), a deliberate prosecution strategy that maximises enforcement options across both claim types while using a single inventive concept.
Core inventive concept: The claims solve the problem of tedious text-only descriptions of image regions by enabling a user to provide a GUI-based contextual prompt — such as a click, loupe, marker, or segmentation annotation — that "indicates an area of emphasis in the image," whereupon the multimodal machine learning model is "configured to condition the textual response to the image on the contextual prompt," producing a targeted response without requiring the user to textually describe the image region.
Independent Claim Dissection
Claim
Preamble
Transition
Key Body Elements
Claim 1
A method of interacting with a pre-trained multimodal machine learning model, the method
comprising
providing a GUI configured to enable user interaction with an image to generate a contextual prompt indicating area of emphasis; receiving the contextual prompt; generating input data using image and contextual prompt; generating a textual response by applying input data to multimodal ML model configured to condition textual response to image on contextual prompt; providing textual response to user; wherein textual response comprises a prompt suggestion and providing response comprises displaying a selectable control in GUISearch prior art ↗
Claim 13
A system for interacting with a pre-trained multimodal machine learning model, the system
comprising
at least one processor; at least one non-transitory computer readable medium containing instructions that cause system to: provide GUI configured to enable user interaction with image to generate contextual prompt indicating area of emphasis; receive contextual prompt; generate input data; generate textual response by applying input data to multimodal ML model configured to condition response on contextual prompt; provide textual response; wherein textual response comprises prompt suggestion and providing response comprises displaying selectable control programmed to enable user to select prompt suggestionSearch prior art ↗
Claim Dependency Tree
1 Method: GUI-based contextual prompt interaction with pre-trained multimodal ML model; response conditioned on image area of emphasisSearch Claim 1 prior art ↗
2 Adds: generating input data comprises generating updated image based on contextual prompt and generating input data using updated imageSearch in Eureka ↗
3 Adds: generating input data comprises generating segmentation mask by providing image and contextual prompt to segmentation model; input data generated using image and segmentation maskSearch in Eureka ↗
4 Adds: generating input data comprises generating a textual prompt or token using the contextual prompt; and generating input data using image and textual prompt or tokenSearch in Eureka ↗
5 Further: (dep. on 4) textual prompt or token indicates coordinates of a location in the imageSearch in Eureka ↗
6 Adds: receiving contextual prompt comprises detecting a user human interface device interactionSearch in Eureka ↗
7 Adds: GUI includes an annotation tool; contextual prompt comprises an annotation generated using annotation toolSearch in Eureka ↗
8 Further: (dep. on 7) annotation tool includes a loupe, a marker, or a segmentation toolSearch in Eureka ↗
9 Further: (dep. on 7) GUI enables user to resize an area of effect of annotation toolSearch in Eureka ↗
10 Adds: further includes receiving a textual prompt from user; input data further generated using textual prompt; ML model further conditions response on textual promptSearch in Eureka ↗
11 Adds: in response to selection of control by user, generating second input data using prompt suggestion and image; generating second response; providing second responseSearch in Eureka ↗
12 Adds: contextual prompt indicates object depicted in image; textual response provides information about depicted object; textual response displayed as virtual button in GUISearch in Eureka ↗
13 System: processor + non-transitory CRM; same operative steps as Claim 1 including GUI, contextual prompt, input data generation, ML model response conditioning, selectable prompt suggestion controlSearch Claim 13 prior art ↗
14 Adds: (dep. on 13) generating input data comprises generating updated image based on contextual promptSearch in Eureka ↗
15 Adds: (dep. on 13) generating input data comprises generating segmentation mask via segmentation modelSearch in Eureka ↗
16 Adds: (dep. on 13) generating input data comprises generating textual prompt or token with coordinates of a location in imageSearch in Eureka ↗
17 Adds: (dep. on 13) GUI includes annotation tool; contextual prompt comprises annotation generated using annotation toolSearch in Eureka ↗
18 Further: (dep. on 17) annotation tool includes loupe/marker/segmentation; GUI enables resize; further operations include receiving textual prompt and further conditioning responseSearch in Eureka ↗
19 Adds: (dep. on 13) in response to selection of control, generating second input data and second response using prompt suggestionSearch in Eureka ↗
20 Further: (dep. on 19) contextual prompt indicates object; textual response provides information about object; displayed as virtual buttonSearch in Eureka ↗
Metric
This Application
Software / AI Industry Norm
Total claims
20
15 – 25
Independent claim count
2
2 – 4
Dependent : Independent ratio
9.0 : 1
4 – 8 : 1
Method claims present?
Yes — Claim 1
Common
System / apparatus claims?
Yes — Claim 13
Common
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Drafting Quality
Drafting Quality Signals
The patent demonstrates strong spec–claim consistency through extensive GUI scenario walkthrough in the detailed description, and uses strategically broad "comprising" transitions throughout all independent and dependent claims. The most significant weakness is the absence of a computer-readable medium (CRM) claim, leaving an entire enforcement vector uncovered that competitors could exploit with software-only implementations.
✅
Antecedent Basis
The antecedent basis is clean throughout all 20 claims. Claim 1 introduces "a graphical user interface," "the contextual prompt," "the image," and "the multimodal machine learning model" in proper sequence, and all subsequent references in Claims 2–12 correctly use definite articles referencing these introduced elements. Claim 13 independently re-introduces "a user," "an image," and "the contextual prompt" in the correct order, with dependent Claims 14–20 properly back-referencing. No orphaned "the" references were found across the claim set.
All major claim limitations map to specific figures and paragraphs. The GUI providing an area-of-emphasis contextual prompt (Claim 1, element 1) maps to FIGs. 3A–3F and the detailed description at col. 5–6. The segmentation mask limitation of Claim 3 maps to FIG. 3F and col. 7–8. The annotation tool resize of Claim 9 maps to FIGs. 3B–3C and col. 9. The prompt suggestion selectable control of Claims 1 and 11 maps directly to FIGs. 5A, 5B, 6A, 6B and col. 13–15. The tokenization process underlying Claims 4–5 is described at col. 7–8 but lacks a dedicated figure, which is the single notable consistency gap.
All independent and dependent claims use "comprising" consistently, which is the correct open-ended transition for this technology domain, ensuring that additional system components (e.g., additional pre-processing modules or secondary models) do not break infringement. The use of "comprising" in the system Claim 13 preamble and in each operative step is particularly important given OpenAI's multimodal architecture, where implementations may include components beyond those explicitly claimed. No missed opportunity for "consisting essentially of" narrowing was identified, and no inadvertent limiting transitions were used.
No "means for" or "step for" language appears anywhere in the 20 claims, substantially eliminating §112(f) MPF exposure. The system Claim 13 uses structural recitation ("at least one processor" and "at least one non-transitory computer readable medium containing instructions") rather than functional "means" language, which is the appropriate drafting approach for software-implemented systems post-Williamson v. Citrix. The term "configured to" is used throughout Claim 13 and is well-established as structural language avoiding §112(f) invocation. No latent MPF risks were identified.
The claims carry moderate §101 Alice/Mayo exposure because the core inventive concept — applying a multimodal ML model conditioned on a GUI contextual prompt — could be characterized by an examiner as an abstract idea of "processing information based on user input." The hardware tie-in in Claim 13 ("at least one processor" and "non-transitory computer readable medium") provides some §101 defense, and the GUI annotation tool limitations of Claims 7–9 and 17–18 add specificity. However, the method Claim 1 contains no explicit hardware recitation and relies entirely on functional steps, making it the most vulnerable claim in a §101 challenge — a stronger filing would have included a hardware-tied preamble or referenced the image processing system 120 architecture of FIG. 1.
The dependent claims add genuinely distinct technical limitations: Claim 3 adds segmentation mask generation (a specific implementation pathway), Claim 4–5 add tokenization with coordinate indication (a data-processing mechanism), Claims 7–9 add the annotation tool hierarchy (loupe/marker/segmentation with resize capability), Claim 10 adds secondary textual prompt conditioning, and Claims 11–12 add the second-response-generation loop and virtual button display. These represent meaningfully different fall-back positions that would each require separate invalidation arguments. The primary weakness is that Claims 14–20 structurally mirror Claims 2–12 for the system claim, which adds prosecution breadth but does not introduce new technical concepts.
The abstract accurately describes the method embodiment but omits the system claim and any mention of the prompt engineering mechanism that distinguishes the invention from prior art. The abstract states the model is "configured using prompt engineering to identify a location in the image conditioned on the image and the textual prompt" — this reasonably captures the key technical mechanism. However, the abstract does not mention the selectable control / prompt suggestion feature that is recited in both independent claims as a required limitation, which could lead a reviewer searching only the abstract to mischaracterize the scope and miss the interactive prompt suggestion loop as a core feature.
Figure support is strong for UI-level limitations: FIGs. 3A–3F cover all annotation tool types recited in Claims 7–9, FIGs. 5A–5B and 6A–6B cover the prompt suggestion selectable control of Claims 1 and 11–12, and FIG. 4 covers the second-response generation loop. However, the tokenization process underlying Claims 4–5 (generating textual prompts or tokens with coordinate indicators from contextual prompts) lacks a dedicated figure — only the general flow of FIG. 2 (step 230) provides indirect support. Additionally, the ML model conditioning mechanism described in Claims 1 and 13 maps only to the abstract FIG. 8 platform diagram rather than to a figure showing how the multimodal model architecture implements the conditioning on contextual prompt.
Analysis powered by PatSnap Eureka. Patent text and figures publicly available from USPTO. Draft a Similar Patent
Scorecard
Strategic Intent Scorecard
Multi-dimensional assessment of this application's patent strategy quality, based on claim structure, specification depth, and prosecution positioning.
Claim Breadth
3.5
Prosecution Defensibility
3.8
Spec–Claim Consistency
4.2
Dependent Claim Coverage
4
Claim Type Diversity
2.5
Figure Support Quality
3.8
Key observation: Spec–Claim Consistency scores highest (4.2/5) because the detailed description maps every operative UI state to a named figure — FIGs. 3A–3F, 5A–5B, and 6A–6B provide granular visual support for the annotation tool and prompt suggestion limitations of Claims 7–12 and 17–20. Claim Type Diversity scores lowest (2.5/5) because the patent covers only method and system claim types, entirely omitting the CRM/computer-program-product claim type that would capture pure software implementations and close the most obvious design-around pathway for competitors distributing multimodal AI applications. A practitioner reviewing this patent should prioritise filing a continuation with at least one CRM independent claim directed to the non-transitory medium storing instructions for the GUI contextual prompt interaction workflow.
A senior-attorney lens on the three highest-priority structural weaknesses — what each exposes in prosecution and litigation, and what a stronger filing would have done differently.
GAP 01 · HIGHEST IMPACT
No CRM Claim Leaves Software Distribution Uncovered
The patent covers only method (Claim 1) and system (Claim 13) claim types, with no computer-readable medium claim directed to non-transitory storage of the GUI-based contextual prompt interaction instructions. This structural omission means that a competitor distributing a software application (e.g., a mobile app or SDK) implementing the identical multimodal interaction workflow may escape infringement if they can argue the absence of an "at least one processor" system in the distributing entity's hands. A stronger filing would have added a third independent claim reciting "a non-transitory computer-readable medium storing instructions that, when executed by a processor, perform the steps of: providing a GUI configured to enable a user to interact with an image to generate a contextual prompt that indicates an area of emphasis in the image" — the specification already supports this at FIG. 7 (Configured Medium 720) and col. 17–18.
Both independent Claims 1 and 13 include a mandatory wherein clause requiring that "the textual response comprises a prompt suggestion" and that providing the response comprises "displaying a selectable control" — this structurally narrows the independent claims to embodiments that always generate prompt suggestions, inadvertently excluding the simpler use case where the model provides a direct textual response without a suggestion. The specific prosecution risk is that a competitor can design around both independent claims by implementing a multimodal model that returns only a direct answer (not a suggestion) conditioned on a GUI annotation, bypassing Claims 1 and 13 entirely while copying the core innovation. A stronger filing would have placed the prompt suggestion limitation only in dependent Claims 11 and 19, keeping the independent claims free of this constraint.
GAP 03 · HIGH IMPACT
No Claims Cover Multimodal Video or Real-Time Stream Input
Unlock to read the full analysis.
🔒
3 Critical Gaps in This Claim Set
See the full attorney-level analysis of what this application leaves unprotected — and how to draft it more defensively for your own filings.
No CRM claim for software distributionPrompt suggestion limits independent claimsNo video or real-time stream input claims
US 12,039,431 B1 protects a method and system for interacting with a pre-trained multimodal machine learning model through a graphical user interface that allows a user to provide a contextual prompt — such as a click, loupe annotation, drawn mark, or segmentation selection — indicating an area of emphasis within a displayed image. The invention solves the problem of users needing to textually describe image regions to AI models by instead using GUI-based spatial indicators, whereupon the multimodal ML model conditions its textual response on both the image and the contextual prompt, and returns a response that may include selectable prompt suggestions.
US 12,039,431 B1 is owned by OpenAI OpCo, LLC, located in San Francisco, CA, US. The inventors are Noah Deutsch (San Francisco, CA, US), Nicholas Turley (San Francisco, CA, US), and Benjamin Zweig (San Francisco, CA, US).
Claim 1 is a method claim covering the steps of providing a GUI for image-based contextual prompt generation indicating an area of emphasis, receiving the contextual prompt, generating input data using the image and prompt, applying the input data to a multimodal ML model configured to condition its textual response on the contextual prompt, and providing the response (which must include a prompt suggestion displayed as a selectable control). Claim 13 is a system claim covering a processor-based system with non-transitory computer-readable medium instructions that perform the same operations as Claim 1, including GUI provision, contextual prompt reception, input data generation, ML model conditioning, and selectable prompt suggestion display.
This patent covers technology that lets users point to or draw on a part of an image to ask an AI assistant a question about that specific area, rather than having to type out a lengthy description of where in the image they want to focus. For example, a user could circle a plant in a photo on their phone, and the AI would automatically understand they are asking about that plant and suggest relevant questions like 'What kind of basil is this?' The AI then provides a targeted answer about just that area of the image.
G06N 3/0455 (2023.01) — Transformers (neural network architectures based on the attention mechanism). G06N 3/08 (2013.01) — Learning methods for artificial neural networks.
Still have questions? PatSnap Eureka can answer them from patent data instantly. Search in Eureka
PatSnap Eureka
Ready to Draft Your Next Patent with AI?
PatSnap Eureka's AI drafting agent writes structured claims, flags coverage gaps, and positions your application for prosecution success.
Disclaimer: This analysis is generated by PatSnap Eureka AI based on publicly available patent data from the USPTO. It does not constitute legal advice and should not be relied upon as such. Patent data may be subject to change as prosecution progresses. Scores and assessments reflect automated analysis and may not capture all relevant legal or technical nuances. Always consult a qualified patent attorney for formal legal opinions on patentability, freedom to operate, or infringement.
Ask anything about this patent. PatSnap Eureka searches patents and data to answer instantly.