How RLHF works: the three-stage pipeline and four-model architecture
RLHF aligns LLM outputs with human preferences through three canonical stages: supervised fine-tuning (SFT) on human-authored demonstrations; reward model (RM) training, in which human annotators compare pairs of model outputs and a scalar reward model is trained to predict preference rankings; and reinforcement learning optimization — typically using Proximal Policy Optimization (PPO) — in which the reward model acts as an environment signal to push the language model policy toward higher-scoring responses. This three-stage structure is documented in detail by Peking University’s 2026 filing on game-theoretic preference optimization and confirmed by Tsinghua University’s 2024 training architecture patent.
The reward model training stage is the central engineering challenge of RLHF. As documented by Beijing Baichuan Intelligent Technology (2024), the RLHF approach deploys four concurrent model components: a reward model, a critic model, an actor model, and the initialized LLM itself. The actor model generates candidate responses; the critic evaluates action quality; and the KL divergence between the initial policy and the actor’s policy is added as a regularization term to prevent the fine-tuned model from drifting too far from the pre-trained distribution. This KL penalty is essential to training stability and is a defining architectural feature of standard RLHF implementations.
In RLHF, the KL (Kullback–Leibler) divergence between the reference policy and the active policy model is added as a regularization term during PPO training. It prevents the fine-tuned model from drifting excessively from the pre-trained distribution — a safeguard against reward hacking and catastrophic forgetting. Small changes to the KL coefficient can produce dramatic performance swings, making it one of RLHF’s most sensitive hyperparameters.
PPO is the dominant RL algorithm within RLHF implementations. As described in a 2025 patent from Suzhou Yuannao Intelligent Technology, PPO samples prompts from a pool, generates responses, computes immediate rewards from the reward model alongside KL-regularized future returns, and iteratively updates the policy model using gradient descent on the advantage function. The computational overhead is substantial: the system must simultaneously maintain and query four distinct model instances. This multi-model requirement is one of RLHF’s most frequently cited engineering drawbacks — China Construction Bank’s 2024 filing explicitly notes that traditional PPO-based RLHF involves numerous hyperparameters and can lead to training instability and collapse.
RLHF requires four concurrent model instances at training time — a reward model, a critic model, an actor (policy) model, and the initialized reference LLM — as documented by Beijing Baichuan Intelligent Technology (2024) and Peking University (2026).
How DPO works: collapsing reward learning into classification
Direct Preference Optimization eliminates the explicit reward model entirely by reformulating preference alignment as a binary classification problem. Instead of asking “which response maximizes reward?”, DPO asks “is this response’s implicit reward high or low?” — and optimizes that objective directly using preference pair data (a chosen response versus a rejected response) through a cross-entropy-type loss function. This reformulation, articulated in Ant Zhixin’s 2025 patent, reduces the training infrastructure from four model instances to two: the policy model being trained and a frozen reference SFT model.
Zhejiang University’s 2024 medical LLM fine-tuning patent explicitly recommends DPO over RLHF for domain-specific applications precisely because DPO does not require an explicit reward function or a separately trained reward model — reducing both engineering overhead and annotation cost in data-constrained settings.
The mathematical structure of DPO relies on a reference model to provide a baseline. The policy model is updated to increase the relative log-probability of preferred responses over rejected ones. As formalized in Alipay (Hangzhou)’s 2025 training patent, the DPO loss function is expressed as a function of the log-probability ratios of the policy model (πθ) relative to the reference model (πref) for both chosen (yw) and rejected (yl) responses, weighted by a temperature hyperparameter β. The reference model provides the anchor, and the two models must maintain a degree of independence to ensure effective optimization. According to WIPO filing data, this streamlined two-model approach has driven rapid adoption of DPO across Chinese university and industry filings from 2024 onward.
Despite its architectural simplicity, DPO has documented limitations. Midu Technology’s 2025 patent reports that DPO is susceptible to overfitting on limited human preference datasets, reducing generalization to unseen prompts. Additionally, the reference model itself must be trained and maintained, adding implementation complexity, and the model’s performance ceiling is bounded by the capability of the reference model. BOE Technology’s 2025 patent provides a concise summary: DPO uses an implicit reward function derived from policy and reference model probability ratios to bypass the explicit reward modeling and multi-stage training flow of traditional RLHF.
“DPO reformulates the RLHF problem: instead of asking which response maximizes reward, it converts the task into a binary classification problem — and directly optimizes that objective using preference pair data through a cross-entropy-type loss function.”
Explore the full patent landscape for RLHF and DPO alignment techniques in PatSnap Eureka.
Search LLM Alignment Patents in PatSnap Eureka →DPO (Direct Preference Optimization) eliminates the explicit reward model by converting preference optimization into a binary classification problem over chosen and rejected response pairs, requiring only two model instances — the policy model and a frozen reference SFT model — compared to RLHF’s four.
Head-to-head: stability, reward hacking, and online vs. offline learning
The most consequential differences between RLHF and DPO emerge not in their stated objectives — both aim to align model outputs with human preferences — but in their failure modes, computational requirements, and suitability for online versus offline training regimes.
Training stability and hyperparameter sensitivity
RLHF implementations using PPO are widely noted for training instability. CloudWalk Technology’s 2025 patent confirms that RLHF training consumes large computational resources, is prone to convergence difficulties due to RL’s inherent instability, and is extremely sensitive to hyperparameters — small changes to the KL divergence coefficient can produce dramatic performance swings. Du Xiaoman Technology’s 2025 patent adds that coordinated updates between the actor and critic models are necessary to prevent instability and oscillation, and reward signals must be calibrated at both sentence and token levels. DPO’s loss function is more stable by design, but introduces its own sensitivity: the temperature parameter β controls the influence of the reference policy, and the balance between overfitting to the preference dataset and retaining pre-trained capabilities must be carefully managed.
Reward hacking vs. dataset overfitting
RLHF’s explicit reward model is both its strength and its vulnerability. The reward model learns a rich, continuous scoring function over responses, which in principle provides nuanced feedback. However, as flagged by Miaomu Ltd.’s 2025 patent, reward models in RLHF are sensitive to noise in human annotations, exhibit limited generalization in deployment settings, and are susceptible to reward hacking — where the policy model exploits artifacts in the reward model rather than genuinely improving quality. DPO avoids this failure mode by embedding an implicit reward signal in the cross-entropy loss, but at the cost of relying entirely on the quality and coverage of the preference dataset: if the chosen/rejected pairs are noisy or unrepresentative, DPO overfits or learns incorrect distinctions.
Online vs. offline learning: a structural divide
Standard RLHF is an online learning process — the policy model generates new responses, receives reward signals, and updates iteratively in a feedback loop. DPO is primarily an offline learning process that optimizes over a fixed dataset of preference pairs. Alipay (Hangzhou)’s 2025 patent highlights that vanilla DPO requires upfront human annotation — design of questions and answers plus manual scoring — before training begins, which limits training efficiency and accuracy. Their proposed multi-round self-iterative DPO approach addresses this by using the model itself to generate and score candidate answers iteratively, partially simulating the online dynamics of RLHF.
On complex tasks requiring nuanced reasoning or multi-objective optimization, RLHF with PPO retains structural advantages. Suzhou Yuannao Intelligent Technology’s 2026 patent states explicitly that although DPO provides a simpler solution, RL-based training remains indispensable for generalization and universality on complex tasks, as it guides the model through reward mechanisms to make decisions more consistent with human logic. Research published by Nature on reinforcement learning in sequential decision-making corroborates the generalization advantages of online policy update methods over static offline optimization.
Standard RLHF is an online learning process in which the policy model generates new responses and updates iteratively via PPO, while DPO is an offline learning process that optimizes over a fixed dataset of preference pairs — a structural distinction with significant implications for generalization on complex tasks.
Application domains and the patent landscape
The patent corpus surveyed spans filings from 2023 through 2026 across China, the United States, Korea, and WIPO. Dominant assignees by filing volume include Alipay (Hangzhou), Google LLC, DeepMind Technologies, Microsoft Technology Licensing, Baidu, Tencent, and multiple Chinese universities including Peking University, Zhejiang University, Harbin Institute of Technology, Southeast University, and Tsinghua University. Both RLHF and DPO are deployed across financial services, healthcare, and general-purpose language tasks — but with different architectural preferences by domain.
Financial applications include multiple patents from Alipay and China Construction Bank, where RLHF-based reward model training uses preference rankings from financial domain experts. Actimize Ltd.’s 2026 patent applies the full RLHF pipeline with a reward model and PPO specifically for fraud detection in financial tabular data. Medical LLM fine-tuning is addressed by Zhejiang University’s 2024 patent, which explicitly prefers DPO for reduced engineering overhead in data-constrained clinical settings. Google LLC’s 2026 patent extends RLHF by replacing human annotators with search engine feedback signals, demonstrating that the reward signal source can be automated without fundamentally changing the RLHF architecture — a finding with significant implications for annotation cost. Standards bodies such as IEEE have similarly noted the scalability challenges of human-in-the-loop training for production AI systems.
Track which organisations are filing RLHF and DPO patents across jurisdictions with PatSnap Eureka.
Explore Patent Assignee Data in PatSnap Eureka →Microsoft Technology Licensing’s 2026 patent covers in-situ user interaction feedback as a continuous alignment signal — an approach that blurs the boundary between online RLHF and offline DPO by capturing preference signals from live deployment rather than pre-annotated datasets. Chinese university filings contribute theoretical extensions including game-theoretic DPO alignment (Peking University), multi-reward confidence-weighted DPO, bias-intensity-weighted DPO loss functions, and minimax robustness frameworks — indicating that the academic frontier has moved well beyond vanilla implementations of either method.
Google LLC’s 2026 patent on reinforcement learning with search engine feedback demonstrates that human annotation is not a fixed architectural requirement of RLHF — automated feedback sources can replace human annotators without changing the core RLHF pipeline structure.
Beyond RLHF and DPO: hybrid frameworks and next-generation alternatives
A significant patent cluster involves hybrid methods that integrate SFT, DPO, and PPO within unified training pipelines — a trend that reflects the practical limitations of deploying either method in isolation. Baidu’s 2024 patent proposes combining DPO and PPO algorithms for SFT training into a unified alignment learning framework, explicitly stating that current frameworks merging only SFT and DPO lose the advantages of online reinforcement learning, and that methods merging DPO and PPO represent the next generation of alignment training.
Qingdao Ant Robot’s 2025 patent addresses the fundamental conflict between SFT (which encourages imitation of expert behavior) and RL (which encourages policy exploration) by employing a dual-path parallel architecture with a dynamic weight fusion mechanism that smoothly transitions from imitation learning to exploration — a principled engineering solution to the instability that arises when switching abruptly between training objectives.
Novel alternatives to both RLHF and DPO are also emerging from large-lab research. Google LLC’s 2025 Posterior Preference Optimization patent takes a Bayesian approach to preference fine-tuning, training the model to predict posterior token probabilities conditioned on human preferences — preserving pre-trained prediction distributions while adding sequential preference-tuned predictions, and addressing the RLHF weakness that balancing multiple preference models is difficult without degrading individual compliance. DeepMind Technologies’ 2025 Tied Preference Optimization patent extends DPO by explicitly modeling tied preferences — cases where neither response is preferred — which standard DPO’s binary formulation cannot handle. DeepMind’s companion 2025 patent on direct posterior preference fine-tuning provides a supervised variant that predicts posterior token probabilities conditioned on positive preferences without requiring additional per-vocabulary-token inference at decoding time. According to PatSnap’s IP intelligence platform, filings in this next-generation preference optimization space have accelerated significantly since 2024.
Google LLC’s federated RLHF patent demonstrates RLHF extended to federated learning settings, where multiple user devices each run local reward models and aggregate scores to a central server for model training — preserving privacy while expanding the diversity of human feedback signals. This architecture is structurally more compatible with RLHF’s explicit reward model than with DPO’s offline preference dataset approach, since federated reward scoring maps naturally to the reward model training stage. The PatSnap Insights blog covers related developments in federated AI and privacy-preserving machine learning.
“Methods merging DPO and PPO represent the next generation of alignment training — combining DPO’s stability with PPO’s online exploration yields superior performance to either alone.”
Baidu’s 2024 patent proposes a unified alignment learning framework combining DPO and PPO, explicitly stating that methods merging DPO and PPO represent the next generation of alignment training, as frameworks merging only SFT and DPO lose the advantages of online reinforcement learning.