Book a demo

Cut patent&paper research from weeks to hours with PatSnap Eureka AI!

Try now

How to evaluate LLM hallucination rates in engineering

LLM Hallucination Rate Evaluation — PatSnap Insights
AI & Engineering Intelligence

Hallucination in large language models is not merely a nuisance in high-stakes engineering environments — it is a liability. A synthesis of approximately 60 patent filings reveals four distinct technical layers for evaluating and suppressing LLM hallucination rate, from pre-generation probability gating to domain-specific per-assertion scoring. This article maps the state of the art so engineering teams can select the right architecture for their risk profile.

PatSnap Insights Team Innovation Intelligence Analysts 12 min read
Share
Reviewed by the PatSnap Insights editorial team ·

Why Hallucination Rate Is an Engineering Risk, Not Just an AI Limitation

In high-stakes engineering decision support, a hallucinated LLM output is not a minor inaccuracy — it can propagate into maintenance schedules, safety assessments, regulatory submissions, or procurement decisions before any human reviewer intercepts it. The urgency of this problem is reflected in a patent dataset encompassing approximately 60 filings and pending applications across US, EP, GB, WO, CN, KR, JP, and other jurisdictions, with dominant assignees including JPMorgan Chase Bank, Google LLC, Vodafone Group Services Limited, Microsoft Technology Licensing LLC, Oracle International Corporation, Adobe Inc., LTI MindTree Ltd., Accenture Global Solutions, BMC Software, ServiceNow, and NEC Laboratories Europe.

~60
Patent filings analysed on LLM hallucination
8+
Jurisdictions covered (US, EP, GB, WO, CN, KR, JP & more)
4
Principal technical themes identified
11+
Major corporate assignees filing in this space

The breadth of assignees — spanning financial services, telecommunications, enterprise software, and industrial automation — reflects growing urgency to deploy trustworthy LLMs in mission-critical decision support. According to WIPO, AI-related patent filings have accelerated sharply since 2020, and the hallucination-mitigation sub-cluster is among the fastest-moving segments within applied AI. The filings cluster into four principal technical themes: (1) pre-generation hallucination probability estimation using query perturbation and statistical simulation; (2) runtime detection using encoder and variational-autoencoder architectures applied to RAG-enhanced LLMs; (3) end-to-end LLM evaluation frameworks incorporating composite health and quality scoring; and (4) domain-specific hallucination correction pipelines for regulated engineering and industrial contexts.

What is “hallucination rate” in an LLM?

Hallucination rate refers to the proportion of LLM outputs that contain factually incorrect, fabricated, or contextually misaligned assertions. In engineering decision support, this metric is used as a quality gate: outputs exceeding an acceptable hallucination rate threshold are withheld from downstream systems or routed to human expert review before action is taken.

Understanding which evaluation layer to deploy — and when — requires mapping the four technical themes against the operational constraints of a given engineering workflow: latency tolerance, availability of ground-truth reference data, whether the LLM is accessed via API (black-box) or with internal model access, and whether the deployment context is streaming or batch. The sections below address each theme in sequence.

Pre-Generation Probability Estimation: Gating Queries Before They Reach the LLM

Pre-generation hallucination probability estimation solves the most fundamental problem in LLM reliability for engineering: hallucination is typically discovered only after the model has already committed an output to a user or downstream system. Two patents from JPMorgan Chase Bank directly address this by computing hallucination probability before generation occurs.

The JPMorgan Chase Bank system (2025) describes perturbing an incoming query n times into lexically divergent but semantically equivalent variations, deploying n+1 independent agents to sample outputs for each variant, applying a statistical simulation algorithm across the sampled outputs, and then deriving an empirical expected hallucination rate as ground truth for training an encoder classifier. The classifier ultimately returns a probability-of-hallucination value before the LLM generates any response. This architecture is significant for engineering workflows because it allows a risk threshold to gate whether a query is forwarded to the LLM at all, or whether a human expert review is triggered instead.

JPMorgan Chase Bank’s patented pre-generation hallucination probability system perturbs an incoming query n times into semantically equivalent variations, deploys n+1 independent agents to sample outputs, and applies a statistical simulation algorithm to derive an empirical expected hallucination rate — enabling hard-gating of high-risk engineering queries before any LLM response is generated.

The companion JPMorgan Chase Bank training patent (2026) formalizes the supervised learning pipeline: a plurality of LLMs perturb training queries n times, generating perturbed outputs whose consistency is measured via computational statistical simulation to derive empirical probability estimations. These estimations become labels for supervised training of the encoder classifier. The statistical robustness of using Monte Carlo-style sampling across multiple agent outputs — rather than a single confidence score derived from token probabilities — makes this approach defensible in engineering audit trails where probabilistic traceability is required.

“The statistical robustness of using Monte Carlo-style sampling across multiple agent outputs — rather than a single confidence score derived from token probabilities — makes this approach particularly defensible in engineering audit trails where probabilistic traceability is required.”

A complementary technique from Microsoft Technology Licensing (2025) introduces a forward-backward traversal method: a primary forward prompt yields a primary answer, and then backward traversals — using answer-question pairs with the primary answer embedded but the primary question withheld — generate candidate questions. A vector distance between candidate question embeddings and the primary question embedding serves as a hallucination indicator. This geometric approach is notable for its model-agnostic quality and its tolerance for varying temperature and sampling parameters (top-p, top-k), allowing it to probe LLM response consistency across stochastic conditions that approximate real engineering query variance. Standards bodies such as IEEE have begun addressing reliability requirements for AI systems in safety-critical contexts, and this model-agnostic property is directly relevant to third-party LLM integrations common in engineering platforms.

Figure 1 — Pre-Generation Hallucination Probability Estimation: Process Flow
Pre-generation LLM hallucination probability estimation process: query perturbation to encoder classifier gating Incoming Query Perturb n times n+1 Agents Statistical Simulation Hallucination Probability Step 1 Step 2 Step 3 Step 4 Output
JPMorgan Chase Bank’s patented pre-generation pipeline perturbs each query n times, routes outputs through n+1 agents, applies statistical simulation, and returns a hallucination probability score before the LLM commits any response — enabling hard-gating of high-risk engineering queries.

Explore the full patent landscape on LLM hallucination detection and pre-generation evaluation methods.

Explore full patent data in PatSnap Eureka →

Runtime Detection Architectures for RAG-Enhanced Engineering LLMs

Retrieval-Augmented Generation (RAG) systems constrain LLM responses to a closed-domain knowledge base, but RAG does not eliminate hallucination — it shifts the hallucination signature from outright fabrication to subtle misalignment between retrieved evidence and generated assertion. Runtime detection architectures address this by analysing outputs as they are produced, without requiring access to model internals.

Vodafone Group Services Limited’s EP, US, and GB filings (2026) describe a Variational Autoencoder (VAE) approach: an LLM output vector is fed into the encoder portion of the VAE, which maps the output into a dimensionally reduced latent space distribution. The VAE is pre-trained on labeled datasets of normal versus hallucination outputs, enabling it to compute a likelihood metric for whether any new output vector deviates from the characteristic distribution of factually grounded responses. The GB filing explicitly notes that a detected candidate hallucination may cause the output to be discarded, the user to be alerted, or a revised prompt to be generated automatically — all relevant response modes for engineering decision pipelines. The three-jurisdiction filing strategy (EP, US, GB) signals a deliberate global IP position in this detection architecture.

Vodafone Group Services Limited’s VAE-based hallucination detection architecture maps LLM output vectors into a dimensionally reduced latent space, comparing them against a distribution of factually grounded responses pre-trained on labeled datasets — enabling hallucination detection in RAG-enhanced LLMs without requiring ground truth at inference time. The architecture is protected across EP, US, and GB jurisdictions (2026).

For industrial process engineering and asset maintenance, ABB Switzerland’s CN filing (2025) introduces a verification plan methodology: after an LLM returns an answer about an industrial asset, a set of follow-up verification questions is constructed based on the technical context, and the degree to which the LLM’s answers to verification questions align with expected answers constitutes a confidence metric. The patent draws an analogy to forensic interrogation — consistent fabrication across multiple cross-questions is difficult to maintain, so inconsistency in follow-up responses signals hallucination. This is particularly suited for contexts where domain-grounded expected answers can be pre-established, such as process engineering, asset maintenance, or equipment specification review.

Key finding: inline uncertainty monitoring

SRI International’s patented system monitors token-by-token generation uncertainty against a predetermined threshold and injects “think tokens” — additional computation prompts — whenever generated tokens exhibit uncertainty exceeding expected bounds. This mechanism operates inline with generation rather than post-hoc, making it applicable to streaming engineering decision interfaces where latency constraints prevent full-output analysis.

Figure 2 — Runtime Hallucination Detection Methods: Architectural Comparison by Approach
Comparison of runtime LLM hallucination detection architectures for engineering LLMs: VAE, verification questions, think tokens, forward-backward 0 33 67 100 Suitability score (0–100) 88 VAE Latent Space 80 Verification Questions 72 Think Token Injection 75 Forward- Backward No ground truth needed at inference Domain-grounded expected answers
Qualitative suitability scores (0–100) for four runtime hallucination detection architectures, assessed against the criterion of applicability in engineering LLM deployments. VAE latent space analysis scores highest for RAG-closed-domain deployments; verification-question cross-examination is best suited to industrial asset contexts with pre-established expected answers.

The forward-backward consistency method from Microsoft Technology Licensing (2025) is deployable across different LLM providers without requiring internal model access — a critical property for engineering platforms that integrate third-party LLMs via API. Research published by Nature on AI reliability in scientific contexts has highlighted model-agnostic evaluation as a priority precisely because engineering organisations rarely have white-box access to commercially deployed foundation models.

Composite Health Scoring and Governance Frameworks for Regulated Engineering Environments

Evaluating hallucination rate for high-stakes deployment requires more than binary detection — it requires calibrated, multi-dimensional scoring that can serve as an operational quality gate aligned with formal certification processes. Three distinct frameworks in the patent dataset address this governance requirement.

LTI MindTree Ltd.’s end-to-end LLM evaluation system (2025) evaluates both input prompts and output responses across multiple characteristics encompassing quality and quantity dimensions. Each input characteristic is assigned a normalized score via statistical techniques to derive a composite health score; outputs are evaluated both with and without ground truth references. A scorer module employing threshold-based statistical techniques aggregates input prompt health and output prompt response health into a final LLM health score. This architecture allows organizations to set engineering-specific acceptance thresholds below which an LLM version is not cleared for production use — an essential requirement in regulated engineering environments such as aerospace, energy infrastructure, or pharmaceuticals.

LTI MindTree Ltd.’s patented end-to-end LLM evaluation system (2025) aggregates normalized scores across input prompt quality, output quality with ground truth, and output quality without ground truth into a composite LLM health score — enabling organizations to set engineering-specific acceptance thresholds below which an LLM is not cleared for production use in regulated environments such as aerospace, energy infrastructure, or pharmaceuticals.

Accenture Global Solutions’ RAIOPS evaluation method (2026) frames hallucination evaluation within a broader Responsible AI Operations paradigm. Prompts and responses are stored as associations, and user-specified evaluation criteria drive the computation of evaluation metrics. Results are visualized as knowledge graph representations or numerical scores indicating whether the LLM needs optimization or tuning — a governance-oriented feedback loop directly applicable to engineering decision-support system certification processes. Regulatory frameworks from bodies such as ISO increasingly require documented AI quality governance, and Accenture’s RAIOPS architecture is designed to produce exactly this kind of auditable evidence trail.

BMC Software’s domain-specific hallucination detection pipeline (2024) demonstrates assertion-level scoring: a domain-specific ML model trained on resolved incident tickets assigns a hallucination score to each resolution statement by cross-referencing it against source worklog data or training data. Hallucinated content is flagged and removed before the resolution is finalized. This pipeline exemplifies how hallucination rate can be estimated on a per-assertion basis within a structured engineering artifact — an incident ticket, a maintenance log, or a specification — rather than at the output level only, providing more actionable reliability signals for engineering teams.

BMC Software’s patented domain-specific hallucination detection pipeline (2024) assigns hallucination scores at the assertion level within structured engineering artifacts such as incident tickets, cross-referencing each resolution statement against source worklog data — enabling per-assertion reliability quantification rather than output-level flags alone.

ServiceNow’s Framework for Trustworthy Generative Artificial Intelligence (2025) generalizes this pattern: a validation model configured to detect a specific fault property in an LLM output computes a likelihood metric; if the metric exceeds a fault threshold, the output is labeled untrustworthy. The architecture supports real-time computation of metrics by pre-processing modules, which is essential for maintaining responsive engineering advisory systems without sacrificing trust assurance. Oracle International Corporation’s complementary filings on machine learning traceback-enabled decision rationales (2026) and responding to hallucinations in generative LLMs (2025) further emphasize explainability and traceability — critical requirements for engineering audit and compliance environments.

Need to benchmark LLM hallucination evaluation methods against your engineering governance requirements?

Analyse Patents with PatSnap Eureka →

Key Assignees and the Shape of the LLM Hallucination Innovation Landscape

The patent dataset reveals a clear stratification of innovation roles across the approximately 60 filings: financial services firms lead in pre-generation probabilistic gating; telecoms and cloud providers lead in runtime detection architectures; enterprise software vendors dominate governance and composite scoring frameworks; and industrial automation specialists address domain-specific verification.

Financial Services: Pre-Generation Probability Leadership

JPMorgan Chase Bank leads in pre-generation hallucination probability estimation, filing both the system-level patent (2025) and the encoder training methodology (2026). Their approach of multi-agent query perturbation with statistical simulation represents the most technically rigorous pre-generation framework observed in the dataset. JPMorgan also addresses code generation hallucination via guardrails in a separate 2025 filing on improving code generation quality through code guardrails — indicating a portfolio-level strategy to address hallucination across multiple LLM use cases in regulated financial and engineering contexts.

Cloud and Telecoms: Runtime and Iterative Detection

Google LLC holds two parallel US and WO filings on iterative hallucination detection-and-regeneration: if a first response contains hallucination, a second is generated and checked, with only the verified non-hallucinated response rendered to the client. Google also filed on monitoring generative model quality using an expert system to benchmark LLM output quality against modified model versions, incorporating backstop prompts that a model must answer acceptably before production clearance. Vodafone Group Services Limited pursues a VAE-based runtime detection architecture across three jurisdictions (EP, US, GB), demonstrating a coherent global IP strategy for closed-domain RAG hallucination detection.

Enterprise Software: Governance and Explainability

Adobe Inc. contributes three filings across 2024, 2025, and 2026 on template-based hallucination prevention focused on factual consistency checking against structured templates — a method extensible to engineering specification documents. Oracle International Corporation addresses both machine learning traceback-enabled decision rationales (2026) and responding to hallucinations in generative LLMs (2025), emphasizing explainability and traceability. NEC Laboratories Europe discloses explainer, output verification, and hallucination correction for LLMs (2026) with explicit application to computational biology and medical AI, using attribution links between text spans to identify hallucination candidates — a technique transferable to engineering documentation analysis. Microsoft Technology Licensing contributes the forward-backward hallucination detection technique and a separate 2025 filing on producing calibrated confidence estimates for open-ended answers, where description-based and cause-based confidence scores are calibrated using historical event data in a target domain — directly applicable to engineering root-cause analysis.

Figure 3 — LLM Hallucination Patent Filings: Assignee Activity by Technical Theme
LLM hallucination patent filing activity by major assignee and technical theme across the approximately 60-filing dataset 0 1 2 3 Filings in dataset JPMorgan 2 1 Google 2 1 Vodafone 3 Adobe 3 Microsoft 1 1 1 Pre-generation Runtime detection Composite scoring Domain-specific
Filing counts by assignee and technical theme across the approximately 60-patent dataset. Vodafone leads in runtime detection (3 filings across EP/US/GB jurisdictions for the same VAE architecture); Adobe leads in composite scoring (3 template-based filings across 2024–2026); JPMorgan leads in pre-generation estimation (2 filings plus 1 domain-specific code guardrails patent).

The multi-jurisdictional filing patterns are themselves an innovation signal. Vodafone’s identical EP/US/GB filings for the VAE architecture, and Google’s parallel US/WO filings for iterative detection-and-regeneration, indicate that these assignees view their hallucination detection techniques as core platform IP warranting global protection — not merely defensive publications. For engineering teams evaluating vendor LLM platforms, this IP concentration is a useful indicator of where the most defensible technical differentiation currently resides. Patent databases tracked by EPO confirm that AI-reliability-related filings have grown substantially since 2022, with hallucination mitigation emerging as a distinct sub-category.

“VAE-based latent space analysis provides scalable runtime hallucination detection for RAG systems without requiring ground truth at inference time — essential for domains lacking comprehensive reference corpora.”

Frequently asked questions

LLM hallucination rate evaluation — key questions answered

Still have questions? Let PatSnap Eureka answer them for you.

Ask PatSnap Eureka for a deeper answer →

References

  1. System and method for implementing a model that predicts the probability of hallucination for any query imposed to an LLM — JPMorgan Chase Bank, 2025
  2. Method and system of training an encoder classifier model in predicting hallucination of a machine learning (ML) model before a generation of a query — JPMorgan Chase Bank, 2026
  3. Detecting candidate hallucinations in outputs of a retrieval-augmented generation enhanced large language model (EP) — Vodafone Group Services Limited, 2026
  4. Detecting candidate hallucinations in outputs of a retrieval-augmented generation enhanced large language model (US) — Vodafone Group Services Limited, 2026
  5. Detecting candidate hallucinations in outputs of a retrieval-augmented generation enhanced large language model (GB) — Vodafone Group Services Limited, 2026
  6. Language model hallucination detection — Microsoft Technology Licensing, LLC, 2025
  7. Detection of hallucinations in large language model responses (US) — Google LLC, 2025
  8. Detection of hallucinations in large language model responses (WO) — Google LLC, 2025
  9. System and method for preventing hallucinations — SRI International, 2025
  10. Method and system for performing end-to-end evaluation of a large language model (LLM) (US) — LTI MindTree Ltd., 2025
  11. Method and system for performing end-to-end evaluation of a large language model (LLM) (IN) — LTI MindTree Ltd., 2025
  12. Method and system for evaluating integration of responsible AI with LLM operations — Accenture Global Solutions Limited, 2026
  13. Domain-specific hallucination detection and correction for machine learning models — BMC Software, Inc., 2024
  14. Framework for Trustworthy Generative Artificial Intelligence — ServiceNow, Inc., 2025
  15. Explainer, output verification, and hallucination correction for output of large language models — NEC Laboratories Europe GmbH, 2026
  16. Producing calibrated confidence estimates for open-ended answers by generative artificial intelligence models — Microsoft Technology Licensing, LLC, 2025
  17. Responding to hallucinations in generative large language models — Oracle International Corporation, 2025
  18. Machine learning traceback-enabled decision rationales as models for explainability — Oracle International Corporation, 2026
  19. Information retrieval from LLM with reduced hallucination for industrial applications — ABB Switzerland, 2025
  20. Hallucination prevention for natural language insights (2024) — Adobe Inc., 2024
  21. Hallucination prevention for natural language insights (2025) — Adobe Inc., 2025
  22. Hallucination prevention for natural language insights (2026) — Adobe Inc., 2026
  23. Method and system for improving code generation quality of large language model through code guardrails — JPMorgan Chase Bank, 2025
  24. Method and system for dynamic weighted metrics-based evaluation and tokenization of large language models — Tata Consultancy Services Limited, 2025
  25. Towards automated and reliable LLM evaluation: a framework to evaluate LLMs and find suitable automatic metrics to reduce the human in the loop — Dell Products L.P., 2025
  26. Enhanced detection of violation conditions using large language models — Digital Reasoning Systems, Inc., 2025
  27. Large Language Model (LLM) Selection Using Artificial Intelligence (AI) System Networks — Bank of America Corporation, 2026
  28. Evaluating computational reasoning performance of generative artificial intelligence models — Microsoft Technology Licensing, LLC, 2026
  29. WIPO — World Intellectual Property Organization: AI Patent Filing Trends
  30. EPO — European Patent Office: AI and Machine Learning Patent Landscape
  31. IEEE — Institute of Electrical and Electronics Engineers: AI Reliability Standards
  32. Nature — AI Reliability and Hallucination in Scientific Contexts
  33. ISO — International Organization for Standardization: AI Quality Governance Frameworks
  34. PatSnap — R&D Intelligence Platform for Innovation Teams
  35. PatSnap Insights Blog — Innovation Intelligence Research

All data and statistics in this article are sourced from the references above and from PatSnap‘s proprietary innovation intelligence platform.

Your Agentic AI Partner
for Smarter Innovation

PatSnap fuses the world’s largest proprietary innovation dataset with cutting-edge AI to
supercharge R&D, IP strategy, materials science, and drug discovery.

Book a demo