Book a demo

Cut patent&paper research from weeks to hours with PatSnap Eureka AI!

Try now

RLHF vs DPO in LLM fine-tuning: 60+ patent analysis

RLHF vs DPO in LLM Fine-Tuning — PatSnap Insights
AI & Machine Learning

RLHF and DPO represent two competing paradigms for aligning large language models with human preferences — one using an explicit reward model and online reinforcement learning, the other folding the entire objective into a single classification loss. Drawing from over 60 patents filed between 2023 and 2026, this analysis maps the architectural tradeoffs, failure modes, and the hybrid approaches now defining the production frontier.

PatSnap Insights Team Innovation Intelligence Analysts 11 min read
Share
Reviewed by the PatSnap Insights editorial team ·

How RLHF works: the three-stage pipeline and four-model architecture

Reinforcement Learning from Human Feedback aligns LLM outputs with human preferences through a three-stage pipeline: supervised fine-tuning (SFT) on high-quality demonstrations, explicit reward model training on human-annotated preference pairs, and PPO-based reinforcement learning optimization. At each stage, the engineering complexity compounds — and by the time the RL phase begins, the system must simultaneously run four distinct model instances.

60+
Patents analysed (2023–2026)
4
Concurrent model instances in RLHF training
2
Model instances required by DPO
3
RLHF training stages (SFT → RM → PPO)

The reward model training stage is the central engineering challenge. As documented in a 2024 patent from Beijing Baichuan Intelligent Technology, the four concurrent components are: a reward model (scores responses), a critic model (evaluates action quality), an actor model (generates candidate responses), and the frozen reference LLM. The KL divergence between the initial policy and the actor’s policy is added as a regularization term to prevent the fine-tuned model from drifting too far from the pre-trained distribution — a defining architectural feature of all standard RLHF implementations.

RLHF requires four concurrent model instances during the PPO-based RL phase: a reward model, a critic model, an actor (policy) model, and a frozen reference LLM. The KL divergence between the reference and actor policies is used as a regularization term to prevent training instability.

PPO (Proximal Policy Optimization) is the dominant RL algorithm within RLHF implementations. As described in a 2025 patent from Suzhou Yuannao Intelligent Technology, PPO samples prompts from a pool, generates responses, computes immediate rewards from the reward model alongside KL-regularized future returns, and iteratively updates the policy using gradient descent on the advantage function. The multi-model requirement is one of RLHF’s most frequently cited engineering drawbacks: a 2024 patent from China Construction Bank explicitly notes that traditional PPO-based RLHF involves numerous hyperparameters and can lead to training instability and collapse.

Tsinghua University’s 2024 filing confirms the full pipeline: pre-training, then reward model training on human-labeled comparison data (using ranking rather than absolute scores to reduce annotator calibration noise), followed by PPO-based RL fine-tuning. The use of rankings rather than absolute scores is a deliberate design choice to reduce variance introduced by differences in individual annotator calibration.

Figure 1 — RLHF three-stage training pipeline: model count per stage
RLHF LLM fine-tuning pipeline: concurrent model instances per training stage 0 1 2 3 Model Instances 1 Stage 1: SFT 2 Stage 2: RM Training 4 Stage 3: PPO RL SFT Reward Model Training PPO RL Fine-Tuning
RLHF’s computational footprint escalates across its three stages: SFT requires a single model, reward model training requires two, and the PPO RL phase requires four concurrent instances — the primary driver of its infrastructure overhead.

How DPO works: reward-free preference optimization

Direct Preference Optimization eliminates the explicit reward model entirely by reformulating preference alignment as a binary classification problem: instead of asking “which response maximizes reward?”, DPO asks “is this response’s reward high or low?” and optimizes that objective directly over preference pair data using a cross-entropy-type loss function. This single reformulation removes the need for the RL training loop and reduces the required model instances from four to two.

What is the DPO loss function?

The DPO loss is expressed as a function of the log-probability ratios of the policy model (πθ) relative to the reference model (πref) for both chosen (yw) and rejected (yl) responses, weighted by a temperature hyperparameter β. The reference model provides the anchor, and the two models must maintain a degree of independence to ensure effective optimization. This formulation was described in a 2025 Alipay (Hangzhou) patent filing.

The standard DPO pipeline is two-stage: SFT followed by DPO preference optimization over chosen/rejected pairs. Zhejiang University’s 2024 patent on medical LLM fine-tuning explicitly recommends DPO over RLHF precisely because it does not require an explicit reward function or a separately trained reward model — a significant advantage in data-constrained or domain-specific deployment scenarios where the engineering overhead of RLHF is prohibitive.

DPO (Direct Preference Optimization) converts LLM preference alignment into a binary classification problem over chosen and rejected response pairs, using a cross-entropy-type loss function weighted by a temperature hyperparameter β. It requires only two models — the policy model and a frozen reference SFT model — compared to RLHF’s four.

Despite its simplicity, DPO has documented limitations. A 2025 patent from Midu Technology reports that DPO is susceptible to overfitting on limited human preference datasets, reducing generalization to unseen prompts. Additionally, the reference model itself must be trained and maintained, and the model’s performance ceiling is bounded by the capability of that reference model. The temperature parameter β introduces its own sensitivity: it controls the influence of the reference policy, and the balance between overfitting to the preference dataset and retaining pre-trained capabilities must be carefully managed.

A structurally important limitation is DPO’s offline nature. A 2025 Alipay patent notes that vanilla DPO requires upfront human annotation — design of questions and answers plus manual scoring before training begins — which limits training efficiency. Their proposed multi-round self-iterative DPO approach addresses this by using the model itself to generate and score candidate answers iteratively, partially simulating the online dynamics of RLHF without the full four-model infrastructure.

“Although DPO provides a simpler solution, RL-based training remains indispensable for generalization and universality on complex tasks, as it guides the model through reward mechanisms to make decisions more consistent with human logic.”

Explore the full patent landscape on RLHF, DPO, and LLM alignment with PatSnap Eureka’s AI-native search.

Explore LLM alignment patents in PatSnap Eureka →

Head-to-head: architecture, stability, and failure modes compared

The core tradeoffs between RLHF and DPO span four dimensions: training architecture complexity, reward signal fidelity, stability under training, and the online versus offline learning distinction. Each dimension reveals a different set of engineering risks that practitioners must weigh against their deployment context.

Training architecture complexity

RLHF requires constructing, training, and running inference on multiple concurrent models during the RL phase. As documented in a 2025 patent from Du Xiaoman Technology, coordinated updates between the actor and critic models are necessary to prevent instability and oscillation, and reward signals must be calibrated at both sentence and token levels. A 2025 patent from Beijing Zhongke Wenge Technology explicitly states that RLHF requires large-scale labeled data with human expert ranking followed by complex RL pipeline execution, whereas DPO uses offline preference pair data to directly optimize model parameters — a contrast that makes DPO significantly more accessible for teams without large-scale annotation infrastructure.

Reward signal fidelity and reward hacking

RLHF’s explicit reward model is both its strength and its vulnerability. The reward model learns a rich, continuous scoring function over responses, which in principle provides nuanced feedback. However, as flagged in a 2025 patent from Miaomu Ltd., reward models in RLHF are sensitive to noise in human annotations, exhibit limited generalization in deployment settings, and are susceptible to reward hacking — where the policy model exploits artifacts in the reward model rather than genuinely improving quality. This is a well-documented failure mode also tracked by researchers at DeepMind and Anthropic.

Reward hacking in RLHF occurs when the policy model learns to exploit artifacts in the reward model rather than genuinely improving response quality. DPO avoids reward hacking by embedding an implicit reward signal in its cross-entropy loss, but is instead prone to overfitting on limited or noisy preference datasets.

Stability and hyperparameter sensitivity

RLHF implementations using PPO are widely noted for training instability. A 2025 patent from CloudWalk Technology confirms that RLHF training consumes large computational resources, is prone to convergence difficulties due to RL’s inherent instability, and is extremely sensitive to hyperparameters such as the KL divergence coefficient — small changes can produce dramatic performance swings. DPO’s loss function is more stable, but introduces its own sensitivity through the β temperature parameter. According to WIPO‘s patent filings database, the volume of patents addressing RLHF instability has increased substantially in the 2023–2026 period, reflecting the scale of the engineering challenge.

Figure 2 — RLHF vs. DPO: comparative attribute scores across five dimensions
RLHF vs DPO comparison across key LLM fine-tuning dimensions RLHF DPO Infrastructure complexity Reward signal richness Training stability Online learning capability Domain-specific practicality 1 3 5 7 9 Score (higher = more of that attribute) 9 3 8 4 3 7 9 2 4 8
Scores are qualitative summaries derived from patent claims and technical disclosures in the corpus. RLHF leads on reward richness and online learning; DPO leads on stability and domain-specific practicality. Infrastructure complexity strongly favours DPO.

Online vs. offline learning

Standard RLHF is an online learning process — the policy model generates new responses, receives reward signals, and updates iteratively in a feedback loop. DPO is primarily an offline learning process that optimizes over a fixed dataset of preference pairs. This distinction has significant implications for generalization: RLHF’s online loop allows continuous policy improvement beyond the static training set, while DPO’s performance ceiling is bounded by the coverage of the preference dataset. For complex tasks requiring nuanced reasoning or multi-objective optimization, a 2026 patent from Suzhou Yuannao Intelligent Technology states explicitly that RL-based training remains indispensable for generalization and universality.

Key finding: the online/offline gap

Vanilla DPO requires upfront human annotation before training begins, limiting training efficiency and accuracy. Multi-round self-iterative DPO approaches — where the model generates and scores its own candidate answers — partially close this gap, but do not fully replicate RLHF’s online feedback dynamics, according to a 2025 Alipay (Hangzhou) patent filing.

Application domains, hybrid approaches, and the next generation

Both RLHF and DPO are deployed across diverse domains in the patent corpus, but the choice between them is rarely arbitrary — it reflects specific constraints around data availability, engineering capacity, and task complexity. Meanwhile, a third category of approaches is emerging that moves beyond both paradigms.

Domain-specific deployments

Financial applications include work from Alipay and China Construction Bank, where RLHF-based reward model training uses preference rankings from domain experts. Actimize Ltd.’s 2026 patent applies the full RLHF pipeline with a reward model and PPO for fraud detection in financial tabular data. Medical LLM fine-tuning, as addressed in Zhejiang University’s 2024 patent, explicitly prefers DPO for reduced engineering overhead. Google LLC’s 2026 patent extends RLHF by replacing human annotators with search engine feedback signals, demonstrating that the reward signal source can be automated without fundamentally changing the RLHF architecture — a finding consistent with alignment research published by Nature on automated feedback mechanisms.

Hybrid SFT+DPO+PPO unified frameworks

A significant patent cluster involves hybrid methods that integrate SFT, DPO, and PPO within unified training pipelines. Baidu’s 2024 patent proposes combining DPO and PPO algorithms for SFT training into a unified alignment learning framework, explicitly noting that current frameworks merging only SFT and DPO lose the advantages of online reinforcement learning, and that methods merging DPO and PPO represent the next generation of alignment training. Qingdao Ant Robot’s 2025 patent addresses the conflict between SFT (which encourages imitation of expert behavior) and RL (which encourages policy exploration) by employing a dual-path parallel architecture with a dynamic weight fusion mechanism that smoothly transitions from imitation learning to exploration.

Track hybrid RLHF+DPO patent filings across jurisdictions with PatSnap Eureka’s real-time innovation intelligence.

Search LLM alignment patents in PatSnap Eureka →

Next-generation alternatives: Bayesian and tied preference methods

Novel alternatives to both RLHF and DPO are emerging from large-lab research divisions. Google LLC’s Posterior Preference Optimization (2025) takes a Bayesian approach, training the model to predict posterior token probabilities conditioned on human preferences, thereby preserving pre-trained prediction distributions while adding sequential preference-tuned predictions. This addresses the RLHF weakness that balancing multiple preference models is difficult without degrading individual compliance. DeepMind Technologies’ Tied Preference Optimization (2025) extends DPO by explicitly modeling tied preferences — cases where neither response is preferred — which standard DPO’s binary formulation cannot handle. DeepMind also filed Direct Posterior Preference Fine-Tuning (2025), a supervised variant that directly predicts posterior token probabilities conditioned on positive preferences without requiring additional per-vocabulary-token inference at decoding time.

Google LLC’s Posterior Preference Optimization (2025 patent) trains LLMs to predict posterior token probabilities conditioned on human preferences, preserving pre-trained distributions while adding preference-tuned predictions. DeepMind’s Tied Preference Optimization (2025 patent) extends DPO to handle tied preferences — cases where neither response is preferred — which standard DPO cannot model.

Federated and privacy-preserving RLHF

Google LLC’s Federated RLHF patent demonstrates RLHF extended to federated learning settings, where multiple user devices each run local reward models and aggregate scores to a central server for model training — preserving privacy while expanding the diversity of human feedback signals. This architecture is structurally more compatible with RLHF’s explicit reward model than with DPO’s offline preference dataset approach, since federated reward scoring maps naturally to the reward model training stage. The implications for privacy-sensitive domains such as healthcare and finance are significant, aligning with standards tracked by ISO on federated and privacy-preserving machine learning.

Who is filing what: key players and innovation trends across the patent corpus

The patent landscape surveyed spans filings from 2023 through 2026 across China, the United States, Korea, and WIPO. The dominant assignees by filing volume include Alipay (Hangzhou), Google LLC, DeepMind Technologies, Microsoft Technology Licensing, Baidu, Tencent, and multiple Chinese universities. Their strategic focus areas are distinct and reveal the competitive shape of the field.

Alipay (Hangzhou) / Ant Group’s multiple CN filings focus on unified multi-stage fine-tuning pipelines integrating RLHF and DPO, iterative self-training DPO, and token-level reward granularity improvements for RLHF. Google LLC’s US and WO filings expand RLHF beyond human feedback to automated feedback sources (search engines), federated reward aggregation, and Bayesian posterior preference optimization. DeepMind Technologies is advancing the theoretical frontier with tied preference handling and direct posterior fine-tuning. Microsoft Technology Licensing covers in-situ user interaction feedback as a continuous alignment signal. Baidu, Beijing Baichuan, Tencent, and Zhipu AI are focused on multi-model ensemble reward learning, actor-critic stabilization, and integrated alignment frameworks combining SFT, DPO, and PPO.

Chinese universities — Peking University, Zhejiang University, Harbin Institute of Technology, Southeast University, Tsinghua University, and Northeast University — are contributing theoretical extensions including game-theoretic DPO alignment, multi-reward confidence-weighted DPO, bias-intensity-weighted DPO loss functions, and minimax robustness frameworks. These academic filings, tracked through PatSnap’s patent search platform, represent a distinct innovation layer from the production-focused industry filings.

A clear trend across the corpus is the migration from pure RLHF implementations toward hybrid or DPO-centric pipelines for production deployment, while RLHF retains primacy for complex multi-objective generalization tasks. Simultaneously, the academic and large-lab patent filings show rapid innovation in replacing both approaches with principled alternatives grounded in Bayesian inference, game theory, and multi-objective optimization. Practitioners seeking to navigate this landscape can access the full corpus through PatSnap Eureka’s AI-native innovation intelligence tools.

Across 60+ patents filed between 2023 and 2026, the dominant trend in LLM alignment is migration from pure RLHF toward hybrid SFT+DPO+PPO unified frameworks for production deployment, while RLHF with PPO retains advantages for complex multi-objective generalization tasks. Assignees include Alipay, Google LLC, DeepMind, Microsoft, Baidu, and multiple Chinese universities.

Frequently asked questions

RLHF vs. DPO in LLM fine-tuning — key questions answered

Still have questions? Let PatSnap Eureka answer them for you.

Ask PatSnap Eureka for a deeper answer →

References

  1. Alipay (Hangzhou) — Fine-tuning method for LLM models and related equipment (2025)
  2. Alipay (Hangzhou) — Preference alignment training method for LLM, electronic device and storage medium (2025)
  3. Google LLC — Posterior Preference Optimization (2025)
  4. Beijing Baichuan Intelligent Technology — Methods and apparatus for reinforcement learning for large language models (2024)
  5. Zhejiang University — Medical LLM model fine-tuning method and related equipment (2024)
  6. Midu Technology — Preference alignment training method for large language models, system, medium and electronic device (2025)
  7. Actimize Ltd. — Fine-tuning large language model to predict and analyze tabular data using human preferences (2026)
  8. Miaomu Co., Ltd. (Beijing Youzhuju) — Training of models for question answering (2025)
  9. Ant Zhixin (Hangzhou) — Fine-tuning method for large language models, apparatus, storage medium and electronic device (2025)
  10. Microsoft Technology Licensing — Aligning large language models with in-situ user interactions and feedback (2026)
  11. Google LLC — Fine-tuning large language model(s) using reinforcement learning with search engine feedback (2026)
  12. Alipay (Hangzhou) — Training method and apparatus for large language models (2025)
  13. Du Xiaoman Technology — Reinforcement learning training method for large language models and related equipment (2025)
  14. Peking University — LLM alignment method and system based on master-slave game preference optimization (2026)
  15. Beijing Baidu Netcom Technology — Training method for alignment model, information processing method and apparatus (2024)
  16. DeepMind Technologies — Tied Preference Optimization for Sequence Processing Models (2025)
  17. DeepMind Technologies — Direct posterior preference fine-tuning (2025)
  18. Google LLC — Federated RLHF via reward score aggregation and targeted improvement of reward models
  19. WIPO — World Intellectual Property Organization patent database
  20. Nature — Research on automated feedback mechanisms in machine learning alignment
  21. ISO — Standards on federated and privacy-preserving machine learning

All data and statistics in this article are sourced from the references above and from PatSnap‘s proprietary innovation intelligence platform.

Your Agentic AI Partner
for Smarter Innovation

PatSnap fuses the world’s largest proprietary innovation dataset with cutting-edge AI to
supercharge R&D, IP strategy, materials science, and drug discovery.

Book a demo