Skip to content

homzer/Q-RM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prompt: bag has a 5% discount. If it is marked $140, how much will you pay after the discount?

image Token credit assignment visualization of an example from GSM8K. We compare our method Q-RM with DPO-RM on trajectories $\tau^l$ and $\tau^w$.

Discriminative Policy Optimization for Token-Level Reward Model

Hongzhan Chen1, Tao Yang2*, Shiping Gao1, Ruijun Chen1, Xiaojun Quan1*, Hongtao Tian2, Ting Yao2
chenhzh59@mail2.sysu.edu.cn, luckytyang@tencent.com, quanxj3@mail.sysu.edu.cn
1Sun Yat-sen University 2Wechat Search, Tencent Inc
*Corresponding authors

News

  • [2025.05.02] Q-RM has been accepted to the ICML 2025.

Introduction

Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks.

In this study, we revisit token-level reward modeling through the framework of the Bradley-Terry model and maximum entropy RL. We decouple token reward modeling from language generation and derive a token-level reward model by optimizing a discriminative policy, which we term the Q-function Reward Model (Q-RM). Q-RM defines token-level credits using the logits of the discriminative model and can be trained from preference data without requiring fine-grained annotations.

Furthermore, we provide theoretical insights suggesting that computing advantage functions from the logits of the discriminative policy can align with leveraging optimal Q-functions. This property enables Q-RM to substitute Q-functions and directly deliver token-level supervised signals for policy optimization. Consequently, integrating Q-RM with PPO may simplify the estimation of advantage and potentially reduce the reliance on GAE.

Requirement

Library Recommend
python >=3.10
torch >=2.0.0
transformers >=4.51.0

Environment Setup

conda create -n qrm python=3.10
conda activate qrm

pip install -r requirements.txt

Data

Unzip the training data:

unzip data/gsm8k/train/dpo/results.zip -d data/gsm8k/train/dpo/
unzip data/gsm8k/train/pairwise/results.zip -d data/gsm8k/train/pairwise/
unzip data/gsm8k/train/ppo/results.zip -d data/gsm8k/train/ppo/
unzip data/gsm8k/train/sft/results.zip -d data/gsm8k/train/sft/

unzip data/math/train/dpo/results.zip -d data/math/train/dpo/
unzip data/math/train/pairwise/results.zip -d data/math/train/pairwise/
unzip data/math/train/ppo/results.zip -d data/math/train/ppo/
unzip data/math/train/sft/results.zip -d data/math/train/sft/

Reward Model Training

Before the training process starts, you should follow the instructions in the training scripts in scripts directory to prepare your data and model checkpoints.

cd scripts

sh train-q-rm.sh

Reward Model Evaluation

If you want to evaluate your reward model, run the following command:

cd scripts

sh evaluate-q-rm.sh

This will produce two PDF files (chosen_1.pdf and rejected_1.pdf) that visualize the token-level reward assignment for the chosen and rejected trajectories, respectively.

REINFORCE Training

cd scripts

sh train-reinforce-with-q-rm.sh

PPO Training

cd scripts

sh train-ppo-with-q-rm.sh

Citation

@misc{chen2025discriminativepolicyoptimizationtokenlevel,
      title={Discriminative Policy Optimization for Token-Level Reward Models}, 
      author={Hongzhan Chen and Tao Yang and Shiping Gao and Ruijun Chen and Xiaojun Quan and Hongtao Tian and Ting Yao},
      year={2025},
      eprint={2505.23363},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.23363}, 
}

Releases

No releases published

Packages

No packages published
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy