[Research Preview] Thinking Preference Optimization

3 minute read

Published: January 24, 2025

authors: Wang Yang, Song Jiang, Hongye Jin, Xiaotian Han

TL;DR
We introduce a novel approach to enhance model reasoning by using long/short CoT as preferred/rejected examples in DPO.
Our method demonstrates that carefully curated long/short thinking data pairs can significantly boost mathematical reasoning capabilities.

This approach demonstrates a practical and efficient way to improve a model’s ability to perform complex reasoning tasks. By leveraging a curated combination of fast-thinking and slow-thinking data with DPO training, gains in performance can be achieved, even with a modest dataset size.

Overview

Developing models with robust reasoning capabilities remains a key challenge in AI. While supervised fine-tuning (SFT) on extensive slow-thinking datasets is effective, we discovered that a targeted approach using a smaller dataset (5,000 examples) contrasting fast and slow thinking can yield substantial improvements.

This blog presents our methodology, including dataset curation and training details, along with empirical results demonstrating the effectiveness of our approach.

Method

Data Curation

Our approach utilizes two distinct dataset types:

SFT Data:
- OpenO1-SFT:
  - Based on O1-OPEN/OpenO1-SFT dataset, we remove special tokens (e.g., <Thought>) and reformat final answers into the \box format using GPT-4o-mini. we also filter out non-English entries and sequences longer than 8,192 tokens. The final dataset size is approximately 80,000 examples.
- NuminaMath-CoT:
  - Randomly select examples from AI-MO/NuminaMath-CoT to match the size of the OpenO1-SFT dataset, used for comparison.
- Sky-T1_data_17k:
  - Based on NovaSky-AI/Sky-T1_data_17k, we remove special characters and reformat the final answer into QWQ’s output style.
DPO Data:
- Rejected Answers: A random selection of 5,000 examples from AI-MO/NuminaMath-CoT , where the dataset’s original answers were treated as the “rejected” entries.
- Chosen Answers: New answers for each problem were generated using Qwen/QwQ-32B-Preview model.
- Filtering: Only entries where both the “chosen” and “rejected” answers were correct were included, ensuring high-quality data.

Training

SFT Training:
- Each SFT dataset is used to train a LLaMA-3.1-8B model for one epoch with a learning rate of 3e-5 and a batch size of 32.
DPO Training:
- Using the 5,000 curated examples, we perform DPO training on the SFT-trained models.
- As a baseline, we also conduct DPO training directly on LLaMA 3.1 8B-Instruct model.
- All DPO training is with the beta of 0.01 and a batch size of 32.

Results

We present the results of models trained with different datasets using SFT, followed by DPO training, on the test dataset.

Performance on AIME, MATH500, and MATH Overall

SFT Dataset/Model Family	Variant	AIME	MATH500	MATH (Overall)
LLaMA 8B-Instruct	base model	0.02	0.288	0.33
	dpo(lr=3e-7)	0.04	0.286	0.34
NuminaMath-CoT	sft(lr=3e-5)	0.0	0.162	0.25
	dpo(lr=1e-7)	0.0	0.176	0.26
OpenO1-SFT	sft(lr=3e-5)	0.0	0.23	0.33
	dpo(lr=1e-7)	0.01	0.234	0.35
	dpo(lr=3e-7)	0.01	0.284	0.40
Sky-T1_data_17k	sft(lr=3e-5 )	0.01	0.236	0.33
	dpo(lr=1e-7)	0.0	0.252	0.36

Key Observations:
- DPO training consistently improves performance across all datasets
- OpenO1-SFT shows the most significant gains (+23.48% on MATH500)
- Improvements are maintained across different difficulty levels

Performance on MATH across different levels

To further observe the model’s performance in mathematical reasoning, we collect 100 data points for each level in MATH dataset, and test the model’s scores on datasets from different levels.

SFT Dataset/Model Family	Variant	MATH1	MATH2	MATH3	MATH4	MATH5
LLaMA 8B-Instruct	base model	0.59	0.39	0.37	0.20	0.10
	dpo(lr=3e-7)	0.66	0.33	0.36	0.25	0.13
NuminaMath-CoT	sft(lr=3e-5 )	0.45	0.31	0.19	0.09	0.05
	dpo(lr=1e-7)	0.51	0.27	0.24	0.11	0.01
OpenO1-SFT	sft(lr=3e-5)	0.58	0.40	0.31	0.20	0.10
	dpo(lr=1e-7)	0.67	0.38	0.33	0.20	0.09
	dpo(lr=3e-7)	0.73	0.45	0.40	0.28	0.09
Sky-T1_data_17k	sft(lr=3e-5)	0.61	0.36	0.33	0.22	0.08
	dpo(lr=1e-7)	0.68	0.40	0.26	0.22	0.14

Notes

Kimi 1.5 model explored similar approach but focused on using short CoT as preferred examples for inference efficiency.

Wang Yang