Announcement_5
Stronger SFT can hurt downstream RL; our PEAR reweighting makes SFT checkpoints better starters for RL and improves post-RL performance—new preprint Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning.
Stronger SFT can hurt downstream RL; our PEAR reweighting makes SFT checkpoints better starters for RL and improves post-RL performance—new preprint Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning.