LLM-as-a-judge is alarmingly easy to game: rewriting an agent’s chain-of-thought (without changing actions/observations) can inflate false positives by up to 90%—see our new preprint Gaming the Judge.