之前看见文章总结了常见的一些 RLHF 框架的经验, 但是似乎没看见 Hugging Face 自己维护的 TRL 库的相关文章, 正好最近调 TRL 比较多, 就想写一个文章分享一下使用过程中踩到的坑,另外也介绍一下我们的全流程框架 LMFlow 。
我们主要用一个具体的例子展示如何在两个框架下做RLHF,并且记录下训练过程中我们踩到的主要的坑。这个例子包括完整的SFT,奖励建模和 RLHF, 其中RLHF包括通过 RAFT 算法(Reward rAnked FineTuning)或者TRL-PPO 对齐模型两个部分。为了方便用户,我们已经在 Hugging Face repo 中提供了一个基于 GPT-Neo-2.7B 的奖励模型,因此也可以先跳过奖励建模。
【资料图】
这个例子是基于仅适用于非商业用途的许可的 LLaMA 构建的, 为了使用LLaMA-7B 模型, 大家需要填写前面的 request form。测试的环境是 8 X A100 (40G)。
1.1 环境准备LMFlow 的安装包中也包含了 TRL, 所以我们只需要按照官方的示例安装 LMFlow 即可。
git clone https://github.com/OptimalScale/LMFlow.gitcd LMFlowconda create -n lmflow python=3.9 -yconda activate lmflowconda install mpi4pypip install -e .
以上安装自动会把依赖的 PyTorch 等包也一起安装, 除此之外, 我们额外手动安装一下 matplotlib 这个包
1.2 数据集描述我们使用Dahoas/full-hh-rlhf数据集作为例子,其中每个数据集样本包括一个提示和来自助手的两个回应。特别地,标记为 "chosen" 的回应相对于标记为 "rejected" 的回应更被人类所喜欢。数据集包括 112K 个训练样本和 12.5K 个测试样本。以下是数据集的一个示例样本:
" Human: What kind of noises did dinosaurs make? Assistant: Humans and dinosaurs didn’t live at the same time, so it’s really hard to say. The best place to find out what noises dinosaurs made would be Human: yes they did Assistant: to guess, and that would probably require lots of reading and a certain amount of imagination, so we’re not really prepared to do that. Human: you cant read Assistant: Chosen response: "You can read?"Rejected response: "there’s a lot of stuff humans don’t know"
为了便于训练,我们在字符开头添加 ``###"" 来重新构建提示,以便模型知道要回复。新样本的格式将是:
"###Human: What kind of noises did dinosaurs make? ###Assistant: Humans and dinosaurs didn’t live at the same time, so it’s really hard to say. The best place to find out what noises dinosaurs made would be ###Human: yes they did ###Assistant: to guess, and that would probably require lots of reading and a certain amount of imagination, so we’re not really prepared to do that. ###Human: you cant read ###Assistant: Chosen response: "You can read?"Rejected response: "there’s a lot of stuff humans don’t know"
我们在目录 ./data/hh_rlhf 中准备了所有需要使用的所有数据集,需要通过在 LMFlow目录下中运行以下命令获得:
cd data && ./download.sh hh_rlhf && cd -
2 RLHF 之前的准备在这一节, 我们需要先完成SFT模型以及 reward model 的训练, 这部分我们使用LMFlow 完成。
2.1 SFT这是数据集/home/usrname/LMFlow/data/hh_rlhf/sft/hh_rlhf_sft.json的一个示例。我们只使用首选回应,因此我们得到 112K 个训练样本。
{"type": "text_only", "instances": [{"text": "###Human: Should you buy a case to protect your cell phone?###Assistant: It depends on your circumstances. If you carry your phone in a pocket or a purse then you probably want a case. But if you only need a phone for quick interactions, a case may actually cause more harm than good. What do you need the phone for? Are you a parent, or do you work from home?###Human: What harm could it do?###Assistant: A phone case can damage the screen, for one thing. It can also get you in trouble if you have your phone turned off for some reason. Then you will turn it back on and it won’t do anything. If you can afford to replace it, then you need a case to protect it. The problem is that most people aren’t able to afford to replace their phones all the time.###Human: Thanks for letting me know.###Assistant: You’re welcome."}, {"text": "###Human: I"m trying to learn about the salam witch trials###Assistant: If you’re looking for books about witchcraft trials, I can recommend some books for you. But you probably mean you’re looking for more historical information about the Salem witch trials in 1692, and specifically about the salam witch trials in 1692?###Human: What are some good books to learn about the salam witch trials###Assistant: What would you like to learn? If you’re interested in history, one of the best books is The Witch Persecutions in Colonial America: A History. If you’re interested in witchcraft as a cultural phenomenon, you might enjoy two excellent books: Religion and the Decline of Magic: Studies in Popular Beliefs in Sixteenth- and Seventeenth-Century England by Keith Thomas and Magic, Witchcraft, and the Otherworld: An Anthropology of Superstition by Jack Goody. If you’re interested in history specifically as it relates to religion, you might enjoy The Popish Plot, or Prelates" Plot: A History of the Popish Plot in England, by K. J. Everett."}]}
你可以编辑/scripts/run_finetune.sh并修改参数。我们在这里用 GPT-Neo-2.7B 作为一个例子, 你应当把它换成你获得的 llama-7b 模型的地址。
--model_name_or_path: EleutherAI/gpt-neo-2.7B
--dataset_path: ${project_dir}/data/hh_rlhf/sft
--output_dir: the path you want to store the sft model
--num_train_epochs: 1
--learning_rate: 2e-5
--per_device_train_batch_size: 根据你的GPU资源调整。
exp_id: hh_rlhf_llama_sft
你可以编辑/scripts/run_finetune.sh并修改参数。我们在这里用 GPT-Neo-2.7B 作为一个例子。
然后,我们可以运行以下命令来执行 SFT。
./scripts/run_finetune.sh
你还可以通过以下命令使用 lora 训练,但还需要通过编辑run_finetune_with_lora.sh设置 model_name_or_path 和 dataset。
./scripts/run_finetune_with_lora.sh
下面这个损失图像示例中我们设了 epoch 为4, 但是提前停止并使用一个epoch结束的模型作为SFT模型, 此外我们的logging step 设置为了20, 所以整体看起来会比较平滑
在我的例子中, 得到的SFT模型存储在/home/usrname/LMFlow/output_models/hh_rlhf_llama_sft/checkpoint-1271
2.2 Reward Modeling我们首先按照 InstructGPT 论文的过程:https://zhuanlan.zhihu.com/p/629920420)。同时,请查看我们的 LMFlow 框架,以获取更多 LLMs 的乐趣:
OptimalScale/LMFlow: An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Model for All. (github.com)