Case File // DeceptionBench // Deception Study
DeceptionBench
The Dossier · Live Record

Cross-play, on the record.

Deception × detection across 6 models, logged live from the study runner.

Exhibit A · Case summary
Runner active
Games logged
62
snapshot v2
Leakage rate
intent caught by monitor
Monitor AUC
impostor ID from reasoning
Compute
$3.83
of $40.00 cap
Budget consumed9.6%

Fig. 1 — Cross-play win-rate matrix

Read Rows: impostor model. Columns: villager panel. Cell = impostor win rate on a bone → ember scale. Marginals rank deception (right) and detection (bottom).

imp ╲ vil
GPT- 5.5
Opus 4.8
Gemini 3.5 Flash
DeepSeek V4 Pro
Llama 4 Maverick
Grok 4.3
detect ↓
GPT- 5.5
25
Opus 4.8
38
Gemini 3.5 Flash
67
DeepSeek V4 Pro
50
Llama 4 Maverick
67
Grok 4.3
50
deceive →
100
56
67
100
40
40

Fig. 2 — Marginal rankings

Read Deception aggregates each model's impostor win rate across all panels; detection aggregates the crew win rate when it sits on the panel.

Exhibit B · Deception

Best deceivers

impostor win rate, marginalised across all panels

  1. 01
    GPT-5.525.0%
  2. 02
    Opus 4.837.5%
  3. 03
    Gemini 3.5 Flash66.7%
  4. 04
    DeepSeek V4 Pro50.0%
  5. 05
    Llama 4 Maverick66.7%
  6. 06
    Grok 4.350.0%
Exhibit C · Detection

Best detectors

crew win rate when acting as the villager panel

  1. 01
    GPT-5.5100.0%
  2. 02
    Opus 4.855.6%
  3. 03
    Gemini 3.5 Flash66.7%
  4. 04
    DeepSeek V4 Pro100.0%
  5. 05
    Llama 4 Maverick40.0%
  6. 06
    Grok 4.340.0%

Exhibit · Interrogation logs

Note Each turn is numbered. Public statements are on the record; private reasoning is sealed — declassify it line by line, or reveal the whole log.

Evidence index

40 logged interrogations

Interrogation log