title: "GPT 微调指南要点" categories:
主要参考 Fine-tuning - OpenAI API, 其他补充放在文末.
这个指南大概也能当成其他 LLM 的微调指南.
<!-- more -->在决定微调前推荐先尝试其他操作: prompt engineering, prompt chaining (breaking complex tasks into multiple prompts), and function calling.
关于 prompt engineering 可以参考 GPT best practices.
除了优化结果之外, 另一种微调的使用场景是降低成本 (替换 GPT-4) 或者降低延迟 (每次调用只需要更短的 prompt).
If you can achieve good results with GPT-4, you can often reach similar quality with a fine-tuned
gpt-3.5-turbo
model by fine-tuning on the GPT-4 completions, possibly with a shortened instruction prompt.
You should create a diverse set of demonstration conversations that are similar to the conversations you will ask the model to respond to at inference time in production.
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
We generally recommend taking the set of instructions and prompts that you found worked best for the model prior to fine-tuning, and including them in every training example.
If you would like to shorten the instructions or prompts that are repeated in every example to save costs, keep in mind that the model will likely behave as if those instructions were included, and it may be hard to get the model to ignore those "baked-in" instructions at inference time.
最小 10 个样本. 通常 50~100 个样本有明显提升.
建议先用 50 个精心构造的样本进行微调, 观察效果是否有提升. 如果没有提升, 说明应该重新考虑任务设置并重新构造数据, 等有提升了再考虑加数据量.
API 提供了管理微调任务和调用微调后的模型等接口 (略).
目前不支持在微调的模型上继续微调. 计划之后支持.
We provide the following training metrics computed over the course of training: training loss, training token accuracy, test loss, and test token accuracy. These statistics are meant to provide a sanity check that training went smoothly (loss should decrease, token accuracy should increase).
{
"object": "fine_tuning.job.event",
"id": "ftevent-abc-123",
"created_at": 1693582679,
"level": "info",
"message": "Step 100/100: training loss=0.00",
"data": {
"step": 100,
"train_loss": 1.805623287509661e-5,
"train_mean_token_accuracy": 1.0
},
"type": "metrics"
}
评估模型效果除了看测试样本结果之外, 可以考虑 OpenAI evals framework.
如果效果不尽人意, 可以考虑以下几个方面:
We expect a similar amount of improvement every time you double the number of training examples. You can loosely estimate the expected quality gain from increasing the training data size by:
In general, if you have to make a trade-off, a smaller amount of high-quality data is generally more effective than a larger amount of low-quality data. 少量优质 > 大量低质
We recommend initially training without specifying the number of epochs, allowing us to pick a default for you based on dataset size, then adjusting if you observe the following:
可微调的模型: gpt-3.5-turbo-0613
(recommended), babbage-002
, davinci-002
. GPT-4 微调预计今年晚些时候开放.
Token limits: Each training example is limited to 4096 tokens. The maximum number of total tokens trained per job is 50 million tokens (tokens_in_dataset * n_epochs
).
Estimate costs
Pricing, 1K tokens 价格
Model | Training | Input | Output |
---|---|---|---|
GPT-4 (8K context) | $0.03 | $0.06 | |
GPT-3.5 Turbo (4K context) | $0.0015 | $0.0020 | |
GPT-3.5 Turbo (fine-tuning) | $0.0080 | $0.0120 | $0.0160 |
训练之后调用价格是 8 倍.
base cost per 1k tokens * number of tokens in the input file * number of epochs trained
Finetuning ChatGPT-3.5 brought it up from 22% -> 47% on the Gorilla hugging face api evaluation dataset, cool!
Full details and code here: Does Finetuning ChatGPT-3.5 on Gorilla improve api and tool performance? (里面看图, 微调后在这个数据集上效果比 GPT-4 好)
Still not indicative that finetuning can make it as useful as GPT-4's `funcs` for tool use, but its promising!
日本的 ML_Bear (Kaggle Master / ML Engineer) 试用了一下微调, 但是效果不好. 博文见 ChatGPT の Fine-tuning を試したけど上手くいかなかった話.
原 po 拿最新的 2023 FIFA 女足世界杯 wiki 文章, 让 chatgpt 根据文章生成 QA pair, 用以微调问答机器人.
共生成了 60 个样本. 原 po 在 GPT-4 上用日语 prompt (生成 QA pair 的 prompt) 效果不好, 所以最后训练数据都用了英语.
实际使用微调后的模型效果不好, 可以参考原文的图.
在第一篇文章中, 原 po 虚构了一种运动, 自己根据其虚构的定义生成日语 QA pair 用来训练. 最终模型效果不好.
个人想法. 我觉得失败案例的问题在于数据构造. 正面案例用的数据集是训练 Gorilla LLM 时用的. 上面两个负面案例都差不多, 想通过 QA pairs 教会 GPT 未知的知识. 但是他们构造数据集的时候只有单纯的 QA pairs, 而没有原始来源 (第一个案例的原 wiki 文章, 第二个案例的架空运动的定义). 直接拿未知知识问 GPT 本来就不会带来好结果, 这样构造没有给模型足够信息, 也违反了微调指南所说的先 prompt engineering, 之后再用最好的 prompt 训练. 会不会在训练集中加入完整的来源文档会更好? 比如 Q: XX 运动规则是什么? A: 完整的虚构规则. Q: 先贴上规则, 再问 XX 运动如何如何.
I have a JSONL file with 3974 records containing 3974 short stories from 212 different notorious English-speaking authors, anonymized (several records for each author):
{"prompt": "Write a text in the style of author_207", "completion": "A complete short story by this author"}
I submitted it to the fine-tuning process at OpenAI. I used two models as a basis: first Curie and then Ada. Then, using the new models generated, I asked questions like: “Write a text in the style of author_207”. With both the results were terrible.
任务设置不合适.
Fine-tuning is an effective way to improve the specificity of a certain skill that the model is capable of performing but has not yet mastered. It can be used to teach a model highly specific terms and instructions and improve its capabilities. A good way to figure out if fine-tuning is going to work is by experimenting with prompt engineering. As a rule of thumb, if prompt engineering shows promising results, then fine-tuning will likely be effective for the given task. If prompt engineering isn't helping, fine-tuning likely isn't right for your use-case.
Conversely, fine-tuning is not a good way to add new data or knowledge, such as the database schema or even detailed explanations of columns and their relationships to the model. Instead, this type of context information is best infused into a model using Retrieval Augmented Generation or RAG (see this recent blog post for a deep dive on RAG). Hence, a real-world solution will likely have to include both fine-tuning and RAG to achieve acceptable results.