Skip to content

使用API调用奖励模型得到的奖励值与PPO过程中的奖励值差距巨大/The disparity between the reward values obtained from calling the reward model using the API and the reward values from the PPO process is huge #1641

使用API调用奖励模型得到的奖励值与PPO过程中的奖励值差距巨大/The disparity between the reward values obtained from calling the reward model using the API and the reward values from the PPO process is huge

使用API调用奖励模型得到的奖励值与PPO过程中的奖励值差距巨大/The disparity between the reward values obtained from calling the reward model using the API and the reward values from the PPO process is huge #1641