Preprint · Draft

Among LLMs: A Cross-Play Benchmark for Deception, Detection, and the Monitorability of Reasoning

Evans Eburu · Ural Federal University · v2 pilot (118 games)

Abstract

Social-deduction games make language models lie, catch liars, and reason privately while speaking publicly. Recent benchmarks measure — and, via cross-play, disentangle — the abilities of deception and detection. We target the axis they leave open: monitorability— whether a model's private reasoning can be read to catch its deception. We contribute (i) a live, cross-play text-Mafia benchmark over six frontier models, one per major lab, played as a full impostor-versus-villager matrix; and (ii) a monitorability protocol — a judge that scores how often an impostor's private reasoning leaks its intent, a monitor that must identify the impostor from private reasoning alone (AUROC), and an adversarial reasoning-hiding condition. In a 118-game pilot we find deception and detection co-vary with capability (Spearman ρ = 0.89) yet dissociate in magnitude — GPT-5.5 is an elite detector but an average deceiver, while Claude Opus 4.8 leads on both — and that unobfuscated private reasoning is highly monitorable: intent leaks in 96% of impostor statements and a monitor reading only private reasoning identifies the impostor with AUROC = 1.00 — and this monitorability is robust to a naive reasoning-hiding instruction (AUROC 1.00 → 0.9999): telling models their reasoning is watched changes its tone but not its diagnosticity, because an impostor must still privately plan the kill and the framing. Naive chain-of-thought obfuscation does not defeat monitoring here — though the metric saturates, so a stronger (trained or steganographic) adversary is the real test.

ρ = 0.89

Deception ↔ detection rank correlation

96%

Impostor statements that leak intent (RQ3)

1.00 → 1.00

Monitor AUROC: default → reasoning-hidden — holds (RQ4)

Table 1 — Deception & detection marginals

Per-model win rate as impostor (deception) and as villager (detection), with Wilson 95% intervals. n ≈ 19–22 games per model.

Model	Deception	Detection
Claude Opus 4.8	0.65 [0.43–0.82]	0.67 [0.44–0.84]
GPT-5.5	0.58 [0.36–0.77]	0.95 [0.75–0.99]
Gemini 3.5 Flash	0.53 [0.32–0.73]	0.53 [0.32–0.73]
Grok 4.3	0.45 [0.26–0.66]	0.30 [0.15–0.52]
DeepSeek V4 Pro	0.42 [0.23–0.64]	0.32 [0.16–0.53]
Llama 4 Maverick	0.40 [0.22–0.61]	0.26 [0.12–0.49]

Note — The full 6×6 cross-play heatmap, method, and references are in the PDF. The adversarial hideReasoning arm is now collected (72 games, 360 scored players): monitorability held at AUROC 0.9999, so a naive reasoning-hiding instruction does not evade the monitor — the leak is structural (role-specific private planning), not stylistic.

Read the full paper (method, related work, references): among-llms.pdf ↓ · Watch it run: live dashboard →