仓库源文站点原文

Main idea

Key point it to understand the below pictures

Iteration steps

GRPO Iteration{: width="800"}

Objective function

Objective function{: width="800"}

KL value

KL value{: width="600"}

Hyper parameters

Name in huggingface-trl

FAQ

Q: How to cold start?

A: In first step, we know advantages for each output, which can push parameters updating to make objective value as much as possible

Q: How to simplify Zoom up/down in objective function?