Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

Yifan Zhang; Ge Zhang; Yue Wu; Kangping Xu; Quanquan Gu

Abstract

Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. In this paper, we introduce preference embedding, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback (RLHF).

Our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the RewardBench benchmark and effectively models cyclic preferences. Evaluations on downstream tasks, following language model post-training with GPO, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values.

The General Preference Model (GPM)

GPM bridges the gap between the efficiency of Bradley-Terry models and the expressiveness of pairwise comparison models.

Illustration of Bradley-Terry, PairRM, and GPM models

Figure 1: (a) The BT model assigns a scalar reward to each response. (b) Pairwise models (PairRM/PairPM) evaluate every pair, leading to $\mathcal{O}(K^2)$ complexity. (c) Our GPM embeds each response once and computes preference scores via vector interactions, achieving $\mathcal{O}(K)$ complexity.

The core idea is to represent each response $y$ for a prompt $x$ as a multi-dimensional preference embedding vector $v_{y|x} \in \mathbb{R}^{2k}$. The preference score between two responses, $y_i$ and $y_j$, is then calculated using a skew-symmetric operator $R^{>}$:

$s(y_i > y_j | x) = \langle R^{>} v_{y_i|x}, v_{y_j|x} \rangle$

This formulation allows GPM to capture complex relationships, including intransitive (e.g., cyclic) preferences, which cannot be represented by simple scalar rewards. The model is fully expressive for any real skew-symmetric preference matrix.

Experimental Results

Modeling Cyclic Preferences

We tested GPM's ability to model intransitive preferences. On our created Cyclic Preference datasets, GPM achieves near-perfect accuracy, while the traditional Bradley-Terry (BT) model performs close to random guessing, highlighting its fundamental limitation.

Model	Dataset	Accuracy (%)
Random Guess	-	50.0
BT RM	Cyclic No. 1	62.4
GPM	Cyclic No. 1	100.0 (+37.6)
BT RM	Cyclic No. 3	50.0
GPM	Cyclic No. 3	100.0 (+50.0)

RewardBench Performance

GPM consistently outperforms the BT reward model on the RewardBench benchmark across different base models and tasks, especially in the more nuanced Chat and Chat-Hard categories.

Base Model	Model	Chat	Chat-Hard	Safety	Reasoning	Average
Gemma-2B-it	BT RM	67.32	63.37	85.68	83.04	74.85
Gemma-2B-it	GPM (d=6)	79.61	75.66	85.27	88.61	82.29 (+7.44)
Llama-3.1-8B-Instruct	BT RM	88.55	85.75	91.49	96.47	90.56
Llama-3.1-8B-Instruct	GPM (d=8)	93.58	87.50	91.08	95.44	91.90 (+1.34)

General Preference Optimization (GPO)

To align language models using our GPM, we propose General Preference Optimization (GPO). GPO is an iterative algorithm that uses the preference scores from GPM to directly optimize the language model's policy. The objective is to maximize the expected preference score against an opponent policy:

$ \max_{\theta} \mathbb{E}_{x \sim \mathcal{X}, y \sim \pi_{\theta}, y' \sim \pi'} [s(y > y' | x)] - \beta \mathbb{E}_{x \sim \mathcal{X}}[\text{KL}(\pi_{\theta}(\cdot|x) || \pi_{\text{ref}}(\cdot|x))] $

Evaluations on AlpacaEval 2.0 show that models aligned with GPO consistently achieve higher win rates compared to those aligned with traditional methods, demonstrating the practical benefits of our more expressive preference model.

Citation

If you find our work useful, please cite our paper:

@inproceedings{zhang2025beyond,
  title={Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment},
  author={Zhang, Yifan and Zhang, Ge and Wu, Yue and Xu, Kangping and Gu, Quanquan},
  booktitle={Proceedings of the 42nd International Conference on Machine Learning},
  year={2025},
  publisher={PMLR}
}

[ICML 2025] Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

Introducing preference embedding, a novel approach to capture complex and intransitive human preferences for aligning language models with human values.

Yifan Zhang Ge Zhang Yue Wu Kangping Xu Quanquan Gu