Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

06/08/2025

Daily Paper Cast

0:00

18:39

🤗 Upvotes: 25 | cs.AI

Authors:
Qingyu Ren, Qianyu He, Bowei Zhang, Jie Zeng, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu

Title:
Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

Arxiv:
http://arxiv.org/abs/2508.02150v1

Abstract:
Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.

D'autres épisodes de "Daily Paper Cast"

Plus d'épisodes

Découvrez le meilleur des podcasts sur l'application GetPodcast.

Abonnez-vous à tous vos podcasts préférés, écoutez les épisodes sans connexion internet et recevez des recommandations de podcasts passionnants.

Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

Daily Paper Cast

D'autres épisodes de "Daily Paper Cast"

Qwen-Image Technical Report

SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

CellForge: Agentic Design of Virtual Cell Models

Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report

Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models

Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

PixNerd: Pixel Neural Field Diffusion

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Phi-Ground Tech Report: Advancing Perception in GUI Grounding