Key point it to understand the below pictures
{: width="800"}
G
outputs{: width="800"}
G
is amount of outputs in each group for each input O_i
is i-th output in current group t
is index of tokens in O_i
q
is input O_i,t
is t-tokens in i-th output pi
is model parameter{: width="600"}
Name in huggingface-trl
beta
weight for KL-value between current model and reference model, increase to avoid over-fitting num_iterations
Numbers of iteration per batch, GRPO iterations times in Algorithm 1 picture
, similar with LRepsilon
for both clip lower_bound and upper_bound epsilon_high
repalce epsilon
for clip upper_bound when existQ: How to cold start?
A: In first step, we know advantages for each output, which can push parameters updating to make objective value as much as possible
Q: How to simplify Zoom up/down in objective function?