news
| Feb 17, 2026 | LLM-as-a-judge is alarmingly easy to game: rewriting an agent’s chain-of-thought (without changing actions/observations) can inflate false positives by up to 90%—see our new preprint Gaming the Judge. |
|---|---|
| Feb 17, 2026 | Stronger SFT can hurt downstream RL; our PEAR reweighting makes SFT checkpoints better starters for RL and improves post-RL performance—new preprint Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning. |
| Feb 17, 2026 | Surprisingly, plain SGD matches (or beats) AdamW for RL in LLMs while updating <0.02% of parameters—see our new preprint Do We Need Adam?. |