仓库源文，站点原文

Main idea

Key point it to understand the below pictures

GRPO Iteration {: width="800"}

for each input, generator G outputs
for each output, calculate logits_prob for each token in current, old, reference model
calcualte objective value as loss
update old model in each step
update reference model in each epoch

Objective function {: width="800"}

KL value {: width="600"}

beta weight for KL-value between current model and reference model, increase to avoid over-fitting
num_iterations Numbers of iteration per batch, GRPO iterations times in Algorithm 1 picture, similar with LR
epsilon for both clip lower_bound and upper_bound
epsilon_high repalce epsilon for clip upper_bound when exist
sync_ref_model bool, whether to Whether to synchronize the reference model with the active model every ref_model_sync_steps steps, using the ref_model_mixup_alpha parameter
ref_model_mixup_alpha float, default 0.6, π_ref = α * π_θ + (1 - α) * π_ref_prev
ref_model_sync_steps int, default 512, To use this parameter, you must set sync_ref_model=True.

Q: How to cold start?

A: In first step, we know advantages for each output, which can push parameters updating to make objective value as much as possible

Q: How to simplify Zoom up/down in objective function?