Overview
The Patsnap PatentBench is a benchmark specifically for novelty search tasks in real-world patent scenarios.
It evaluates the performance of three AI tools: Patsnap’s Novelty Search AI Agent, ChatGPT-o3 (with web search), and DeepSeek-R1 (with web search).
The benchmark is based on nearly 100 test samples, each consisting of a “test question” and a “standard answer”—a curated set of X documents from various patent offices that closely represent the ideal references used in actual novelty searches.
Understanding Novelty Search
Novelty search is a key patent task that involves systematically identifying prior art worldwide to determine whether a technical solution is new and inventive under patent law.
It plays a critical role throughout the innovation process, including:
R&D planning: guiding the direction and feasibility of new developments
Pre-filing: verifying that an invention is patentable before submission
Patent examination: helping examiners assess the novelty of applications
Background of the Patsnap PatentBench
The Patsnap PatentBench–Novelty Search benchmark encompasses four key steps:
1) Evaluation Dataset Design
Establishing objective metrics for novelty search requires a methodology grounded in global patent examination practices.
Effective novelty search demands truly objective and standardized evaluation metrics. To address this, we have developed a robust methodology rooted in high-quality benchmarking datasets, constructed from international patent families undergoing parallel examination across multiple patent offices.
The process begins with a preliminary screening focused on the comparability of claims texts among candidate family members. Leveraging our proprietary claims consistency alignment model, we perform semantic alignment and assess technical similarity across claims. A second round of refined filtering ensures textual consistency, effectively eliminating “noise” introduced by linguistic variations.
Next, we identify the X prior art references—those actually cited by patent examiners when assessing novelty and inventive step—as the gold standard for evaluation. These references are then deduplicated, standardized, and integrated, with harmonized document identifiers, citation formats, and office attributions. The result is a consistent, reusable reference set for benchmarking and comparative analysis.
Throughout the dataset construction process, we adhere to the principle of minimal disclosure while implementing rigorous quality control and consistency checks. This ensures that the resulting sample set is representative, stable, well-distributed, and fair—providing a reliable foundation for evaluating novelty search performance in real-world patent examination contexts.
Illustrating 1 Datapoint
2) Building the Dataset
The benchmark uses 89 validated test samples, with controlled distribution across:
Languages: 38.2% Chinese, 61.8% English
IPC classifications: evenly spread across sections A–H
Language distribution of patent texts in 89 test samples
Distribution of IPC codes across 89 test samples
3) Defining Evaluation Metrics
Evaluation metrics are used to measure and compare performance. They can be single or combined indicators, carefully designed by experts to reflect the practical needs of patent professionals.
Novelty search aims to identify relevant prior art to determine whether a patent claim is truly new. It follows the general principles of traditional search logic, but with a specialized focus on patent validity.
The Patsnap PatentBench uses the following indicators to measure the quality of search results:
X Hit Rate:Proportion of samples where a correct answer appears among the top 1, 3, or 5 results.
For each test sample, if any X document from the “X Document Collection” appears in the topK results, it’s marked as “1”; otherwise, “0.” The proportion of samples with a hit defines the “X Detection Rate.” Varying topK levels (e.g., top 100, top 5, top 1) reflect how well AI tools perform prior art searches under different time constraints.
X Hit Rate
Proportion of samples where a correct answer appears among the top 1, 3, or 5 results
X Recall Rate:Measures how well an AI tool retrieves X documents, which is crucial during R&D planning and pre-patent filing. A higher recall helps teams refine technical solutions and draft stronger claims. It’s calculated as the proportion of X documents retrieved in the top 100 results, relative to the total X documents across all test samples.
X Recall Rate
Percentage of correct answers found within the top 100 results
This study uses the top 100 search results to calculate both “X Detection Rate” and “X Recall Rate.” This range is wide enough to catch important prior art, but still manageable for manual review—matching how patent professionals typically search.
4) Comparison Tools
This benchmark compares:
Patsnap’s Novelty Search AI Agent
ChatGPT-o3 (web-enabled)
DeepSeek-R1 (web-enabled)
Patsnap’s AI Agent is purpose-built for patent novelty search, automatically generating reports from technical input. In contrast, ChatGPT-o3 and DeepSeek-R1 are general-purpose language models known for their reasoning capabilities.
Key Findings
Benchmark results show that Patsnap’s Novelty Search AI Agent achieved a 76% X Detection Rate and a 32% X Recall Rate within the top 100 results—significantly outperforming two leading general-purpose AI tools.
1) X Hit Rate
Patsnap’s Novelty Search AI Agent successfully identified at least one relevant X document in 76% of test cases—an essential capability for speeding up decision-making in patent examination and early-stage R&D.
X Hit Rate
The percentage of tests with accurate hits in the top 100 results
2) X Recall Rate
Patsnap’s Novelty Search AI Agent retrieved 32% of all relevant X documents, enabling more thorough analysis and more informed patent claim drafting. A high X Recall Rate is key during R&D planning and before filing a patent. Patsnap’s Novelty Search AI Agent helps teams—whether in-house researchers, patent professionals, or external agents—find more relevant X documents. This supports better technical decisions and stronger patent claims, increasing the chances of patent approval.
X Recall Rate
Share of X documents found in the top 100 results
3) Typical Test Result Sample
In this test, the patent specification (the “problem statement”) was submitted to each AI tool. Their results were then evaluated against a predefined set of X documents (the “model answer”).
Patsnap’s Novelty Search AI Agent successfully identified all four relevant patent families within the top 100 results, achieving an X Hit Rate of 100% and an X Recall Rate of 100%.
By comparison, both ChatGPT-o3 and DeepSeek-R1 also achieved a 100% X Hit Rate. However, ChatGPT retrieved only one relevant patent family, leading to a much lower X Recall Rate of 25%, while DeepSeek failed to retrieve any, resulting in an X Recall Rate of 0%.
These findings highlight that while general-purpose LLMs excel in reasoning, they struggle with highly specialized tasks like patent novelty search. In comparison, domain-specific AI tools like Patsnap’s Novelty Search AI Agent offer superior accuracy and relevance, underscoring their essential role in patent-focused workflows.
A single-sample benchmark test
Future Research
The initial dataset of 89 samples represents the first phase of testing. Future benchmarks will expand this dataset and refine evaluation methods for greater accuracy and coverage.
In real-world patent work, professionals consider more than just retrieval quality—they also weigh factors like efficiency and cost. Key trade-offs include:
Depth vs. speed: How thorough the search needs to be versus how quickly results are needed
In-house vs. outsourced: Whether to conduct searches internally or rely on external experts
Risk vs. resources: Balancing the chance of missing critical prior art against the time and cost of exhaustive searches
Patsnap Novelty Search AI Agent
Use Cases & Its Impact
Patsnap’s Novelty Search AI Agent stands out in benchmark testing thanks to its domain-specific fine-tuning and advanced Retrieval-Augmented Generation (RAG) technology.
Built on an open-source base model, the agent has been systematically refined with specialized patent knowledge, allowing it to understand the nuances of patent language and search logic. By integrating RAG, it combines real-time data retrieval with generative capabilities, enabling high-quality, low-hallucination search results. This empowers the agent to accurately identify key technical features, apply precise search strategies, and outperform general-purpose models in professional novelty search tasks.
For IP professionals in corporations and patent firms, this translates into a powerful productivity boost. Tasks like searching, filtering, and ranking—once taking hours—can now be completed in minutes. This shift allows experts to spend less time on repetitive work and more on strategic analysis and decision-making, transforming workflows from “three days of searching” to “three hours of insight.”
R&D teams also benefit significantly during early-stage project evaluations. The agent enables fast and effective novelty searches from the outset, helping teams avoid investing in non-novel ideas and reducing wasted resources. This leads to a more efficient and impactful innovation process.