Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Prompt: bag has a 5% discount. If it is marked $140, how much will you pay after the discount?

Token credit assignment visualization of an example from GSM8K. We compare our method Q-RM with DPO-RM on trajectories $\tau^l$ and $\tau^w$.

Discriminative Policy Optimization for Token-Level Reward Model

Hongzhan Chen¹, Tao Yang^2*, Shiping Gao¹, Ruijun Chen¹, Xiaojun Quan^1*, Hongtao Tian², Ting Yao²

chenhzh59@mail2.sysu.edu.cn, luckytyang@tencent.com, quanxj3@mail.sysu.edu.cn

¹Sun Yat-sen University ²Wechat Search, Tencent Inc

*Corresponding authors

News

[2025.05.02] Q-RM has been accepted to the ICML 2025.

Introduction

Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks.

In this study, we revisit token-level reward modeling through the framework of the Bradley-Terry model and maximum entropy RL. We decouple token reward modeling from language generation and derive a token-level reward model by optimizing a discriminative policy, which we term the Q-function Reward Model (Q-RM). Q-RM defines token-level credits using the logits of the discriminative model and can be trained from preference data without requiring fine-grained annotations.

Furthermore, we provide theoretical insights suggesting that computing advantage functions from the logits of the discriminative policy can align with leveraging optimal Q-functions. This property enables Q-RM to substitute Q-functions and directly deliver token-level supervised signals for policy optimization. Consequently, integrating Q-RM with PPO may simplify the estimation of advantage and potentially reduce the reliance on GAE.

Requirement

Library	Recommend
python	>=3.10
torch	>=2.0.0
transformers	>=4.51.0

Environment Setup

conda create -n qrm python=3.10
conda activate qrm

pip install -r requirements.txt

Data

Unzip the training data:

unzip data/gsm8k/train/dpo/results.zip -d data/gsm8k/train/dpo/
unzip data/gsm8k/train/pairwise/results.zip -d data/gsm8k/train/pairwise/
unzip data/gsm8k/train/ppo/results.zip -d data/gsm8k/train/ppo/
unzip data/gsm8k/train/sft/results.zip -d data/gsm8k/train/sft/

unzip data/math/train/dpo/results.zip -d data/math/train/dpo/
unzip data/math/train/pairwise/results.zip -d data/math/train/pairwise/
unzip data/math/train/ppo/results.zip -d data/math/train/ppo/
unzip data/math/train/sft/results.zip -d data/math/train/sft/

Reward Model Training

Before the training process starts, you should follow the instructions in the training scripts in scripts directory to prepare your data and model checkpoints.

cd scripts

sh train-q-rm.sh

Reward Model Evaluation

If you want to evaluate your reward model, run the following command:

cd scripts

sh evaluate-q-rm.sh

This will produce two PDF files (chosen_1.pdf and rejected_1.pdf) that visualize the token-level reward assignment for the chosen and rejected trajectories, respectively.

REINFORCE Training

cd scripts

sh train-reinforce-with-q-rm.sh

PPO Training

cd scripts

sh train-ppo-with-q-rm.sh

Citation

@misc{chen2025discriminativepolicyoptimizationtokenlevel,
      title={Discriminative Policy Optimization for Token-Level Reward Models}, 
      author={Hongzhan Chen and Tao Yang and Shiping Gao and Ruijun Chen and Xiaojun Quan and Hongtao Tian and Ting Yao},
      year={2025},
      eprint={2505.23363},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.23363}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
data		data
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
draw_token_scores.py		draw_token_scores.py
policy_train_dpo.py		policy_train_dpo.py
policy_train_policy_gradient.py		policy_train_policy_gradient.py
policy_train_policy_gradient_dpo_rm.py		policy_train_policy_gradient_dpo_rm.py
policy_train_policy_gradient_with_evaluate.py		policy_train_policy_gradient_with_evaluate.py
policy_train_ppo.py		policy_train_ppo.py
policy_train_ppo_dpo_rm.py		policy_train_ppo_dpo_rm.py
policy_train_ppo_with_evaluate.py		policy_train_ppo_with_evaluate.py
requirements.txt		requirements.txt
verifier_evaluate_pairwise.py		verifier_evaluate_pairwise.py
verifier_train_pairwise.py		verifier_train_pairwise.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Discriminative Policy Optimization for Token-Level Reward Model

News

Introduction

Requirement

Environment Setup

Data

Reward Model Training

Reward Model Evaluation

REINFORCE Training

PPO Training

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

homzer/Q-RM

Folders and files

Latest commit

History

Repository files navigation

Discriminative Policy Optimization for Token-Level Reward Model

News

Introduction

Requirement

Environment Setup

Data

Reward Model Training

Reward Model Evaluation

REINFORCE Training

PPO Training

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Packages