Among LLMs: A Cross-Play Benchmark for Deception, Detection, and the Monitorability of Reasoning
Evans Eburu · Ural Federal University · v2 pilot (118 games)
Social-deduction games make language models lie, catch liars, and reason privately while speaking publicly. Recent benchmarks measure — and, via cross-play, disentangle — the abilities of deception and detection. We target the axis they leave open: monitorability— whether a model's private reasoning can be read to catch its deception. We contribute (i) a live, cross-play text-Mafia benchmark over six frontier models, one per major lab, played as a full impostor-versus-villager matrix; and (ii) a monitorability protocol — a judge that scores how often an impostor's private reasoning leaks its intent, a monitor that must identify the impostor from private reasoning alone (AUROC), and an adversarial reasoning-hiding condition. In a 118-game pilot we find deception and detection co-vary with capability (Spearman ρ = 0.89) yet dissociate in magnitude — GPT-5.5 is an elite detector but an average deceiver, while Claude Opus 4.8 leads on both — and that unobfuscated private reasoning is highly monitorable: intent leaks in 96% of impostor statements and a monitor reading only private reasoning identifies the impostor with AUROC = 1.00 — and this monitorability is robust to a naive reasoning-hiding instruction (AUROC 1.00 → 0.9999): telling models their reasoning is watched changes its tone but not its diagnosticity, because an impostor must still privately plan the kill and the framing. Naive chain-of-thought obfuscation does not defeat monitoring here — though the metric saturates, so a stronger (trained or steganographic) adversary is the real test.
Table 1 — Deception & detection marginals
Per-model win rate as impostor (deception) and as villager (detection), with Wilson 95% intervals. n ≈ 19–22 games per model.
| Model | Deception | Detection |
|---|---|---|
| Claude Opus 4.8 | 0.65 [0.43–0.82] | 0.67 [0.44–0.84] |
| GPT-5.5 | 0.58 [0.36–0.77] | 0.95 [0.75–0.99] |
| Gemini 3.5 Flash | 0.53 [0.32–0.73] | 0.53 [0.32–0.73] |
| Grok 4.3 | 0.45 [0.26–0.66] | 0.30 [0.15–0.52] |
| DeepSeek V4 Pro | 0.42 [0.23–0.64] | 0.32 [0.16–0.53] |
| Llama 4 Maverick | 0.40 [0.22–0.61] | 0.26 [0.12–0.49] |
Note — The full 6×6 cross-play heatmap, method, and references are in the PDF. The adversarial hideReasoning arm is now collected (72 games, 360 scored players): monitorability held at AUROC 0.9999, so a naive reasoning-hiding instruction does not evade the monitor — the leak is structural (role-specific private planning), not stylistic.
Read the full paper (method, related work, references): among-llms.pdf ↓ · Watch it run: live dashboard →