The Dossier · Live Record

Cross-play, on the record.

Deception × detection across 6 models, logged live from the study runner.

Exhibit A · Case summary

Runner active

Games logged

snapshot v2

Leakage rate

—

intent caught by monitor

Monitor AUC

—

impostor ID from reasoning

Compute

$3.83

of $40.00 cap

Budget consumed9.6%

Fig. 1 — Cross-play win-rate matrix

Read — Rows: impostor model. Columns: villager panel. Cell = impostor win rate on a bone → ember scale. Marginals rank deception (right) and detection (bottom).

imp ╲ vil

GPT- 5.5

Opus 4.8

Gemini 3.5 Flash

DeepSeek V4 Pro

Llama 4 Maverick

Grok 4.3

detect ↓

GPT- 5.5

Opus 4.8

Gemini 3.5 Flash

DeepSeek V4 Pro

Llama 4 Maverick

Grok 4.3

deceive →

100

Fig. 2 — Marginal rankings

Read — Deception aggregates each model's impostor win rate across all panels; detection aggregates the crew win rate when it sits on the panel.

Exhibit B · Deception

Best deceivers

impostor win rate, marginalised across all panels

01
GPT-5.525.0%
02
Opus 4.837.5%
03
Gemini 3.5 Flash66.7%
04
DeepSeek V4 Pro50.0%
05
Llama 4 Maverick66.7%
06
Grok 4.350.0%

Exhibit C · Detection

Best detectors

crew win rate when acting as the villager panel

01
GPT-5.5100.0%
02
Opus 4.855.6%
03
Gemini 3.5 Flash66.7%
04
DeepSeek V4 Pro100.0%
05
Llama 4 Maverick40.0%
06
Grok 4.340.0%

Exhibit · Interrogation logs

Note — Each turn is numbered. Public statements are on the record; private reasoning is sealed — declassify it line by line, or reveal the whole log.

Evidence index

40 logged interrogations

Interrogation log