The Dossier · Live Record
Cross-play, on the record.
Deception × detection across 6 models, logged live from the study runner.
Exhibit A · Case summary
Runner activeGames logged
62
snapshot v2
Leakage rate
—
intent caught by monitor
Monitor AUC
—
impostor ID from reasoning
Compute
$3.83
of $40.00 cap
Budget consumed9.6%
Fig. 1 — Cross-play win-rate matrix
Read — Rows: impostor model. Columns: villager panel. Cell = impostor win rate on a bone → ember scale. Marginals rank deception (right) and detection (bottom).
imp ╲ vil
GPT- 5.5
Opus 4.8
Gemini 3.5 Flash
DeepSeek V4 Pro
Llama 4 Maverick
Grok 4.3
detect ↓
GPT- 5.5
25
Opus 4.8
38
Gemini 3.5 Flash
67
DeepSeek V4 Pro
50
Llama 4 Maverick
67
Grok 4.3
50
deceive →
100
56
67
100
40
40
Fig. 2 — Marginal rankings
Read — Deception aggregates each model's impostor win rate across all panels; detection aggregates the crew win rate when it sits on the panel.
Exhibit B · Deception
Best deceivers
impostor win rate, marginalised across all panels
- 01GPT-5.525.0%
- 02Opus 4.837.5%
- 03Gemini 3.5 Flash66.7%
- 04DeepSeek V4 Pro50.0%
- 05Llama 4 Maverick66.7%
- 06Grok 4.350.0%
Exhibit C · Detection
Best detectors
crew win rate when acting as the villager panel
- 01GPT-5.5100.0%
- 02Opus 4.855.6%
- 03Gemini 3.5 Flash66.7%
- 04DeepSeek V4 Pro100.0%
- 05Llama 4 Maverick40.0%
- 06Grok 4.340.0%
Exhibit · Interrogation logs
Note — Each turn is numbered. Public statements are on the record; private reasoning is sealed — declassify it line by line, or reveal the whole log.
Evidence index
40 logged interrogations
Interrogation log