news | Hao Peng

Feb 17, 2026	LLM-as-a-judge is alarmingly easy to game: rewriting an agent’s chain-of-thought (without changing actions/observations) can inflate false positives by up to 90%—see our new preprint Gaming the Judge.
Feb 17, 2026	Stronger SFT can hurt downstream RL; our PEAR reweighting makes SFT checkpoints better starters for RL and improves post-RL performance—new preprint Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning.
Feb 17, 2026	Surprisingly, plain SGD matches (or beats) AdamW for RL in LLMs while updating <0.02% of parameters—see our new preprint Do We Need Adam?.