Book a demo

Cut patent&paper research from weeks to hours with PatSnap Eureka AI!

Try now

RLHF vs DPO in LLM fine-tuning: 60+ patent analysis

RLHF vs DPO in LLM Fine-Tuning — PatSnap Insights
AI & Machine Learning

RLHF and DPO are the two dominant paradigms for aligning large language models with human preferences — but they differ fundamentally in architecture, stability, and deployment cost. Drawing from over 60 patents filed across China, the US, Korea, and WIPO, this analysis maps their tradeoffs and the hybrid approaches now displacing both.

PatSnap Insights Team Innovation Intelligence Analysts 11 min read
Share
Reviewed by the PatSnap Insights editorial team ·

How RLHF works: the three-stage pipeline and four-model architecture

RLHF aligns LLM outputs with human preferences through three canonical stages: supervised fine-tuning (SFT) on human-authored demonstrations; reward model (RM) training, in which human annotators compare pairs of model outputs and a scalar reward model is trained to predict preference rankings; and reinforcement learning optimization — typically using Proximal Policy Optimization (PPO) — in which the reward model acts as an environment signal to push the language model policy toward higher-scoring responses. This three-stage structure is documented in detail by Peking University’s 2026 filing on game-theoretic preference optimization and confirmed by Tsinghua University’s 2024 training architecture patent.

60+
Patents & technical disclosures analysed
4
Concurrent model instances in RLHF training
2
Model instances required by DPO
2023–26
Patent filing years surveyed

The reward model training stage is the central engineering challenge of RLHF. As documented by Beijing Baichuan Intelligent Technology (2024), the RLHF approach deploys four concurrent model components: a reward model, a critic model, an actor model, and the initialized LLM itself. The actor model generates candidate responses; the critic evaluates action quality; and the KL divergence between the initial policy and the actor’s policy is added as a regularization term to prevent the fine-tuned model from drifting too far from the pre-trained distribution. This KL penalty is essential to training stability and is a defining architectural feature of standard RLHF implementations.

KL Divergence Penalty in RLHF

In RLHF, the KL (Kullback–Leibler) divergence between the reference policy and the active policy model is added as a regularization term during PPO training. It prevents the fine-tuned model from drifting excessively from the pre-trained distribution — a safeguard against reward hacking and catastrophic forgetting. Small changes to the KL coefficient can produce dramatic performance swings, making it one of RLHF’s most sensitive hyperparameters.

PPO is the dominant RL algorithm within RLHF implementations. As described in a 2025 patent from Suzhou Yuannao Intelligent Technology, PPO samples prompts from a pool, generates responses, computes immediate rewards from the reward model alongside KL-regularized future returns, and iteratively updates the policy model using gradient descent on the advantage function. The computational overhead is substantial: the system must simultaneously maintain and query four distinct model instances. This multi-model requirement is one of RLHF’s most frequently cited engineering drawbacks — China Construction Bank’s 2024 filing explicitly notes that traditional PPO-based RLHF involves numerous hyperparameters and can lead to training instability and collapse.

Figure 1 — RLHF three-stage training pipeline: model instances per stage
RLHF three-stage training pipeline: model instances per stage in LLM fine-tuning Stage 1 Supervised Fine-Tuning (SFT) 1 model Stage 2 Reward Model Training 2 models Stage 3 PPO-Based RL Optimization 4 concurrent models Actor · Critic · Reward · Reference RLHF Pipeline: Model Complexity Increases at Each Stage
RLHF’s computational burden peaks at Stage 3, where four concurrent model instances must be maintained and queried simultaneously — the primary engineering drawback cited across the patent corpus.

RLHF requires four concurrent model instances at training time — a reward model, a critic model, an actor (policy) model, and the initialized reference LLM — as documented by Beijing Baichuan Intelligent Technology (2024) and Peking University (2026).

How DPO works: collapsing reward learning into classification

Direct Preference Optimization eliminates the explicit reward model entirely by reformulating preference alignment as a binary classification problem. Instead of asking “which response maximizes reward?”, DPO asks “is this response’s implicit reward high or low?” — and optimizes that objective directly using preference pair data (a chosen response versus a rejected response) through a cross-entropy-type loss function. This reformulation, articulated in Ant Zhixin’s 2025 patent, reduces the training infrastructure from four model instances to two: the policy model being trained and a frozen reference SFT model.

Key finding

Zhejiang University’s 2024 medical LLM fine-tuning patent explicitly recommends DPO over RLHF for domain-specific applications precisely because DPO does not require an explicit reward function or a separately trained reward model — reducing both engineering overhead and annotation cost in data-constrained settings.

The mathematical structure of DPO relies on a reference model to provide a baseline. The policy model is updated to increase the relative log-probability of preferred responses over rejected ones. As formalized in Alipay (Hangzhou)’s 2025 training patent, the DPO loss function is expressed as a function of the log-probability ratios of the policy model (πθ) relative to the reference model (πref) for both chosen (yw) and rejected (yl) responses, weighted by a temperature hyperparameter β. The reference model provides the anchor, and the two models must maintain a degree of independence to ensure effective optimization. According to WIPO filing data, this streamlined two-model approach has driven rapid adoption of DPO across Chinese university and industry filings from 2024 onward.

Despite its architectural simplicity, DPO has documented limitations. Midu Technology’s 2025 patent reports that DPO is susceptible to overfitting on limited human preference datasets, reducing generalization to unseen prompts. Additionally, the reference model itself must be trained and maintained, adding implementation complexity, and the model’s performance ceiling is bounded by the capability of the reference model. BOE Technology’s 2025 patent provides a concise summary: DPO uses an implicit reward function derived from policy and reference model probability ratios to bypass the explicit reward modeling and multi-stage training flow of traditional RLHF.

“DPO reformulates the RLHF problem: instead of asking which response maximizes reward, it converts the task into a binary classification problem — and directly optimizes that objective using preference pair data through a cross-entropy-type loss function.”

Explore the full patent landscape for RLHF and DPO alignment techniques in PatSnap Eureka.

Search LLM Alignment Patents in PatSnap Eureka →

DPO (Direct Preference Optimization) eliminates the explicit reward model by converting preference optimization into a binary classification problem over chosen and rejected response pairs, requiring only two model instances — the policy model and a frozen reference SFT model — compared to RLHF’s four.

Head-to-head: stability, reward hacking, and online vs. offline learning

The most consequential differences between RLHF and DPO emerge not in their stated objectives — both aim to align model outputs with human preferences — but in their failure modes, computational requirements, and suitability for online versus offline training regimes.

Training stability and hyperparameter sensitivity

RLHF implementations using PPO are widely noted for training instability. CloudWalk Technology’s 2025 patent confirms that RLHF training consumes large computational resources, is prone to convergence difficulties due to RL’s inherent instability, and is extremely sensitive to hyperparameters — small changes to the KL divergence coefficient can produce dramatic performance swings. Du Xiaoman Technology’s 2025 patent adds that coordinated updates between the actor and critic models are necessary to prevent instability and oscillation, and reward signals must be calibrated at both sentence and token levels. DPO’s loss function is more stable by design, but introduces its own sensitivity: the temperature parameter β controls the influence of the reference policy, and the balance between overfitting to the preference dataset and retaining pre-trained capabilities must be carefully managed.

Reward hacking vs. dataset overfitting

RLHF’s explicit reward model is both its strength and its vulnerability. The reward model learns a rich, continuous scoring function over responses, which in principle provides nuanced feedback. However, as flagged by Miaomu Ltd.’s 2025 patent, reward models in RLHF are sensitive to noise in human annotations, exhibit limited generalization in deployment settings, and are susceptible to reward hacking — where the policy model exploits artifacts in the reward model rather than genuinely improving quality. DPO avoids this failure mode by embedding an implicit reward signal in the cross-entropy loss, but at the cost of relying entirely on the quality and coverage of the preference dataset: if the chosen/rejected pairs are noisy or unrepresentative, DPO overfits or learns incorrect distinctions.

Figure 2 — RLHF vs. DPO: comparative dimension scores across five key criteria
RLHF vs. DPO comparative scores across five LLM fine-tuning dimensions 0 1 2 3 4 Score (higher = more challenging) 4 2 Infrastructure Complexity 4 2 Training Instability 4 1 Reward Hacking Risk 4 2 Complex Task Generalization 1 3 Dataset Overfitting Risk RLHF DPO (Higher score = greater challenge in that dimension)
Scores derived from patent corpus analysis. RLHF scores higher on infrastructure complexity, training instability, and reward hacking risk; DPO scores higher on dataset overfitting risk and lower on complex-task generalization.

Online vs. offline learning: a structural divide

Standard RLHF is an online learning process — the policy model generates new responses, receives reward signals, and updates iteratively in a feedback loop. DPO is primarily an offline learning process that optimizes over a fixed dataset of preference pairs. Alipay (Hangzhou)’s 2025 patent highlights that vanilla DPO requires upfront human annotation — design of questions and answers plus manual scoring — before training begins, which limits training efficiency and accuracy. Their proposed multi-round self-iterative DPO approach addresses this by using the model itself to generate and score candidate answers iteratively, partially simulating the online dynamics of RLHF.

On complex tasks requiring nuanced reasoning or multi-objective optimization, RLHF with PPO retains structural advantages. Suzhou Yuannao Intelligent Technology’s 2026 patent states explicitly that although DPO provides a simpler solution, RL-based training remains indispensable for generalization and universality on complex tasks, as it guides the model through reward mechanisms to make decisions more consistent with human logic. Research published by Nature on reinforcement learning in sequential decision-making corroborates the generalization advantages of online policy update methods over static offline optimization.

Standard RLHF is an online learning process in which the policy model generates new responses and updates iteratively via PPO, while DPO is an offline learning process that optimizes over a fixed dataset of preference pairs — a structural distinction with significant implications for generalization on complex tasks.

Application domains and the patent landscape

The patent corpus surveyed spans filings from 2023 through 2026 across China, the United States, Korea, and WIPO. Dominant assignees by filing volume include Alipay (Hangzhou), Google LLC, DeepMind Technologies, Microsoft Technology Licensing, Baidu, Tencent, and multiple Chinese universities including Peking University, Zhejiang University, Harbin Institute of Technology, Southeast University, and Tsinghua University. Both RLHF and DPO are deployed across financial services, healthcare, and general-purpose language tasks — but with different architectural preferences by domain.

Financial applications include multiple patents from Alipay and China Construction Bank, where RLHF-based reward model training uses preference rankings from financial domain experts. Actimize Ltd.’s 2026 patent applies the full RLHF pipeline with a reward model and PPO specifically for fraud detection in financial tabular data. Medical LLM fine-tuning is addressed by Zhejiang University’s 2024 patent, which explicitly prefers DPO for reduced engineering overhead in data-constrained clinical settings. Google LLC’s 2026 patent extends RLHF by replacing human annotators with search engine feedback signals, demonstrating that the reward signal source can be automated without fundamentally changing the RLHF architecture — a finding with significant implications for annotation cost. Standards bodies such as IEEE have similarly noted the scalability challenges of human-in-the-loop training for production AI systems.

Track which organisations are filing RLHF and DPO patents across jurisdictions with PatSnap Eureka.

Explore Patent Assignee Data in PatSnap Eureka →

Microsoft Technology Licensing’s 2026 patent covers in-situ user interaction feedback as a continuous alignment signal — an approach that blurs the boundary between online RLHF and offline DPO by capturing preference signals from live deployment rather than pre-annotated datasets. Chinese university filings contribute theoretical extensions including game-theoretic DPO alignment (Peking University), multi-reward confidence-weighted DPO, bias-intensity-weighted DPO loss functions, and minimax robustness frameworks — indicating that the academic frontier has moved well beyond vanilla implementations of either method.

Google LLC’s 2026 patent on reinforcement learning with search engine feedback demonstrates that human annotation is not a fixed architectural requirement of RLHF — automated feedback sources can replace human annotators without changing the core RLHF pipeline structure.

Beyond RLHF and DPO: hybrid frameworks and next-generation alternatives

A significant patent cluster involves hybrid methods that integrate SFT, DPO, and PPO within unified training pipelines — a trend that reflects the practical limitations of deploying either method in isolation. Baidu’s 2024 patent proposes combining DPO and PPO algorithms for SFT training into a unified alignment learning framework, explicitly stating that current frameworks merging only SFT and DPO lose the advantages of online reinforcement learning, and that methods merging DPO and PPO represent the next generation of alignment training.

Qingdao Ant Robot’s 2025 patent addresses the fundamental conflict between SFT (which encourages imitation of expert behavior) and RL (which encourages policy exploration) by employing a dual-path parallel architecture with a dynamic weight fusion mechanism that smoothly transitions from imitation learning to exploration — a principled engineering solution to the instability that arises when switching abruptly between training objectives.

Figure 3 — Alignment training paradigm evolution: from RLHF to hybrid and next-generation methods
LLM alignment training paradigm evolution: RLHF to DPO to hybrid frameworks and Bayesian posterior preference optimization RLHF 4 models · PPO Online · Complex DPO 2 models · BCE Offline · Simpler Hybrid SFT+DPO+PPO Unified pipeline Next-Gen Bayesian / Tied Posterior prefs Baichuan · PKU Zhejiang U · Ant Baidu · Qingdao Ant Google · DeepMind Federated RLHF (Google)
The patent corpus shows a clear migration from pure RLHF toward hybrid and DPO-centric pipelines for production deployment, with Google and DeepMind advancing Bayesian and tied preference alternatives at the research frontier.

Novel alternatives to both RLHF and DPO are also emerging from large-lab research. Google LLC’s 2025 Posterior Preference Optimization patent takes a Bayesian approach to preference fine-tuning, training the model to predict posterior token probabilities conditioned on human preferences — preserving pre-trained prediction distributions while adding sequential preference-tuned predictions, and addressing the RLHF weakness that balancing multiple preference models is difficult without degrading individual compliance. DeepMind Technologies’ 2025 Tied Preference Optimization patent extends DPO by explicitly modeling tied preferences — cases where neither response is preferred — which standard DPO’s binary formulation cannot handle. DeepMind’s companion 2025 patent on direct posterior preference fine-tuning provides a supervised variant that predicts posterior token probabilities conditioned on positive preferences without requiring additional per-vocabulary-token inference at decoding time. According to PatSnap’s IP intelligence platform, filings in this next-generation preference optimization space have accelerated significantly since 2024.

Google LLC’s federated RLHF patent demonstrates RLHF extended to federated learning settings, where multiple user devices each run local reward models and aggregate scores to a central server for model training — preserving privacy while expanding the diversity of human feedback signals. This architecture is structurally more compatible with RLHF’s explicit reward model than with DPO’s offline preference dataset approach, since federated reward scoring maps naturally to the reward model training stage. The PatSnap Insights blog covers related developments in federated AI and privacy-preserving machine learning.

“Methods merging DPO and PPO represent the next generation of alignment training — combining DPO’s stability with PPO’s online exploration yields superior performance to either alone.”

Baidu’s 2024 patent proposes a unified alignment learning framework combining DPO and PPO, explicitly stating that methods merging DPO and PPO represent the next generation of alignment training, as frameworks merging only SFT and DPO lose the advantages of online reinforcement learning.

Frequently asked questions

RLHF vs. DPO in LLM fine-tuning — key questions answered

Still have questions? Let PatSnap Eureka answer them for you.

Ask PatSnap Eureka for a Deeper Answer →

References

  1. 一种针对LLM模型的微调方法及相关设备 — Alipay (Hangzhou) Information Technology Co., Ltd., 2025
  2. 大语言模型LLM的偏好对齐训练方法、电子设备及存储介质 — Alipay (Hangzhou) Information Technology Co., Ltd., 2025
  3. Posterior Preference Optimization — Google LLC, 2025
  4. 用于大语言模型的强化学习的方法和装置 — Beijing Baichuan Intelligent Technology Co., Ltd., 2024
  5. 一种医疗LLM模型微调方法及相关设备 — Zhejiang University, 2024
  6. 大语言模型的偏好对齐训练方法、系统、介质及电子设备 — Midu Technology Co., Ltd., 2025
  7. Fine-tuning large language model to predict and analyze tabular data using human preferences — Actimize Ltd., 2026
  8. 用于问答的模型的训练 — Miaomu Co., Ltd. (Beijing Youzhuju), 2025
  9. 一种大语言模型的微调方法、装置、存储介质及电子设备 — Ant Zhixin (Hangzhou) Information Technology Co., Ltd., 2025
  10. Aligning large language models with in-situ user interactions and feedback — Microsoft Technology Licensing, LLC, 2026
  11. Fine-tuning large language model(s) using reinforcement learning with search engine feedback — Google LLC, 2026
  12. 大语言模型的训练方法及装置 — Alipay (Hangzhou) Information Technology Co., Ltd., 2025
  13. 一种大语言模型的强化学习训练方法及相关设备 — Du Xiaoman Technology (Beijing) Co., Ltd., 2025
  14. 基于主从博弈偏好优化的大语言模型对齐方法及系统 — Peking University, 2026
  15. 对齐模型的训练方法、信息处理方法及装置 — Beijing Baidu Netcom Technology Co., Ltd., 2024
  16. Tied Preference Optimization for Sequence Processing Models — DeepMind Technologies, 2025
  17. Direct posterior preference fine-tuning — DeepMind Technologies, 2025
  18. Federated RLHF via reward score aggregation and targeted improvement of reward models — Google LLC
  19. WIPO — World Intellectual Property Organization: Global Patent Data
  20. Nature — Reinforcement Learning and Sequential Decision-Making Research
  21. IEEE — Standards and Research on AI Training Systems

All data and statistics in this article are sourced from the references above and from PatSnap‘s proprietary innovation intelligence platform.

Your Agentic AI Partner
for Smarter Innovation

PatSnap fuses the world’s largest proprietary innovation dataset with cutting-edge AI to
supercharge R&D, IP strategy, materials science, and drug discovery.

Book a demo