之前看見文章總結了常見的一些 RLHF 框架的經驗, 但是似乎沒看見 Hugging Face 自己維護的 TRL 庫的相關文章, 正好最近調 TRL 比較多, 就想寫一個文章分享一下使用過程中踩到的坑,另外也介紹一下我們的全流程框架 LMFlow 。
我們主要用一個具體的例子展示如何在兩個框架下做RLHF,并且記錄下訓練過程中我們踩到的主要的坑。這個例子包括完整的SFT,獎勵建模和 RLHF, 其中RLHF包括通過 RAFT 算法(Reward rAnked FineTuning)或者TRL-PPO 對齊模型兩個部分。為了方便用戶,我們已經在 Hugging Face repo 中提供了一個基于 GPT-Neo-2.7B 的獎勵模型,因此也可以先跳過獎勵建模。
【資料圖】
這個例子是基于僅適用于非商業用途的許可的 LLaMA 構建的, 為了使用LLaMA-7B 模型, 大家需要填寫前面的 request form。測試的環境是 8 X A100 (40G)。
1.1 環境準備LMFlow 的安裝包中也包含了 TRL, 所以我們只需要按照官方的示例安裝 LMFlow 即可。
git clone https://github.com/OptimalScale/LMFlow.gitcd LMFlowconda create -n lmflow python=3.9 -yconda activate lmflowconda install mpi4pypip install -e .
以上安裝自動會把依賴的 PyTorch 等包也一起安裝, 除此之外, 我們額外手動安裝一下 matplotlib 這個包
1.2 數據集描述我們使用Dahoas/full-hh-rlhf數據集作為例子,其中每個數據集樣本包括一個提示和來自助手的兩個回應。特別地,標記為 "chosen" 的回應相對于標記為 "rejected" 的回應更被人類所喜歡。數據集包括 112K 個訓練樣本和 12.5K 個測試樣本。以下是數據集的一個示例樣本:
" Human: What kind of noises did dinosaurs make? Assistant: Humans and dinosaurs didn’t live at the same time, so it’s really hard to say. The best place to find out what noises dinosaurs made would be Human: yes they did Assistant: to guess, and that would probably require lots of reading and a certain amount of imagination, so we’re not really prepared to do that. Human: you cant read Assistant: Chosen response: "You can read?"Rejected response: "there’s a lot of stuff humans don’t know"
為了便于訓練,我們在字符開頭添加 ``###"" 來重新構建提示,以便模型知道要回復。新樣本的格式將是:
"###Human: What kind of noises did dinosaurs make? ###Assistant: Humans and dinosaurs didn’t live at the same time, so it’s really hard to say. The best place to find out what noises dinosaurs made would be ###Human: yes they did ###Assistant: to guess, and that would probably require lots of reading and a certain amount of imagination, so we’re not really prepared to do that. ###Human: you cant read ###Assistant: Chosen response: "You can read?"Rejected response: "there’s a lot of stuff humans don’t know"
我們在目錄 ./data/hh_rlhf 中準備了所有需要使用的所有數據集,需要通過在 LMFlow目錄下中運行以下命令獲得:
cd data && ./download.sh hh_rlhf && cd -
2 RLHF 之前的準備在這一節, 我們需要先完成SFT模型以及 reward model 的訓練, 這部分我們使用LMFlow 完成。
2.1 SFT這是數據集/home/usrname/LMFlow/data/hh_rlhf/sft/hh_rlhf_sft.json的一個示例。我們只使用首選回應,因此我們得到 112K 個訓練樣本。
{"type": "text_only", "instances": [{"text": "###Human: Should you buy a case to protect your cell phone?###Assistant: It depends on your circumstances. If you carry your phone in a pocket or a purse then you probably want a case. But if you only need a phone for quick interactions, a case may actually cause more harm than good. What do you need the phone for? Are you a parent, or do you work from home?###Human: What harm could it do?###Assistant: A phone case can damage the screen, for one thing. It can also get you in trouble if you have your phone turned off for some reason. Then you will turn it back on and it won’t do anything. If you can afford to replace it, then you need a case to protect it. The problem is that most people aren’t able to afford to replace their phones all the time.###Human: Thanks for letting me know.###Assistant: You’re welcome."}, {"text": "###Human: I"m trying to learn about the salam witch trials###Assistant: If you’re looking for books about witchcraft trials, I can recommend some books for you. But you probably mean you’re looking for more historical information about the Salem witch trials in 1692, and specifically about the salam witch trials in 1692?###Human: What are some good books to learn about the salam witch trials###Assistant: What would you like to learn? If you’re interested in history, one of the best books is The Witch Persecutions in Colonial America: A History. If you’re interested in witchcraft as a cultural phenomenon, you might enjoy two excellent books: Religion and the Decline of Magic: Studies in Popular Beliefs in Sixteenth- and Seventeenth-Century England by Keith Thomas and Magic, Witchcraft, and the Otherworld: An Anthropology of Superstition by Jack Goody. If you’re interested in history specifically as it relates to religion, you might enjoy The Popish Plot, or Prelates" Plot: A History of the Popish Plot in England, by K. J. Everett."}]}
你可以編輯/scripts/run_finetune.sh并修改參數。我們在這里用 GPT-Neo-2.7B 作為一個例子, 你應當把它換成你獲得的 llama-7b 模型的地址。
--model_name_or_path: EleutherAI/gpt-neo-2.7B
--dataset_path: ${project_dir}/data/hh_rlhf/sft
--output_dir: the path you want to store the sft model
--num_train_epochs: 1
--learning_rate: 2e-5
--per_device_train_batch_size: 根據你的GPU資源調整。
exp_id: hh_rlhf_llama_sft
你可以編輯/scripts/run_finetune.sh并修改參數。我們在這里用 GPT-Neo-2.7B 作為一個例子。
然后,我們可以運行以下命令來執行 SFT。
./scripts/run_finetune.sh
你還可以通過以下命令使用 lora 訓練,但還需要通過編輯run_finetune_with_lora.sh設置 model_name_or_path 和 dataset。
./scripts/run_finetune_with_lora.sh
下面這個損失圖像示例中我們設了 epoch 為4, 但是提前停止并使用一個epoch結束的模型作為SFT模型, 此外我們的logging step 設置為了20, 所以整體看起來會比較平滑
在我的例子中, 得到的SFT模型存儲在/home/usrname/LMFlow/output_models/hh_rlhf_llama_sft/checkpoint-1271
2.2 Reward Modeling我們首先按照 InstructGPT 論文的過程:https://zhuanlan.zhihu.com/p/629920420)。同時,請查看我們的 LMFlow 框架,以獲取更多 LLMs 的樂趣:
OptimalScale/LMFlow: An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Model for All. (github.com)
標簽: