Key point it to understand the below pictures
{: width="800"}
G outputs
{: width="800"}
G is amount of outputs in each group for each input O_i is i-th output in current group t is index of tokens in O_iq is input O_i,t is t-tokens in i-th output pi is model parameter
{: width="600"}
Name in huggingface-trl
beta weight for KL-value between current model and reference model, increase to avoid over-fitting num_iterations Numbers of iteration per batch, GRPO iterations times in Algorithm 1 picture, similar with LRepsilon for both clip lower_bound and upper_bound epsilon_high repalce epsilon for clip upper_bound when existsync_ref_model bool, whether to Whether to synchronize the reference model with the active model every ref_model_sync_steps steps, using the ref_model_mixup_alpha parameterref_model_mixup_alpha float, default 0.6, π_ref = α * π_θ + (1 - α) * π_ref_prevref_model_sync_steps int, default 512, To use this parameter, you must set sync_ref_model=True.Q: How to cold start?
A: In first step, we know advantages for each output, which can push parameters updating to make objective value as much as possible
Q: How to simplify Zoom up/down in objective function?