Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint
CoRR(2024)
摘要
Reinforcement learning (RL) has been widely used in training large language
models (LLMs) for preventing unexpected outputs, reducing harmfulness and
errors. However, existing RL methods mostly adopt the instance-level reward,
which is unable to provide fine-grained supervision for complex reasoning
tasks, and can not focus on the few key tokens that lead to the incorrectness.
To address it, we propose a new RL method named RLMEC that
incorporates a generative model as the reward model, which is trained by the
erroneous solution rewriting task under the minimum editing constraint, and can
produce token-level rewards for RL training. Based on the generative reward
model, we design the token-level RL objective for training and an
imitation-based regularization for stabilizing RL process. And the both
objectives focus on the learning of the key tokens for the erroneous solution,
reducing the effect of other unimportant tokens. The experiment results on
mathematical tasks and question-answering tasks have demonstrated the
effectiveness of our approach. Our code and data are available at
.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要