▶ modelweave.io / replay● live

Counterfactual replay — 30 days of traffic, any rule.

Real 2026 numbers: Opus 4 HumanEval 90.2, GPT-4o 87.0 (eval card), Gemini 1.5 Pro 84.1, Llama-3.1-405B 89.0.

rules
counterfactual scatter · cost × quality
$0.0$5.0$10.0$15.0$20.060%80%100%claude-opus-4-7claude-sonnet-4-6gpt-4ogpt-4o-minigemini-1.5-prollama-3.1-405bmistral-large-2haiku-4-5
x · input cost $/1My · humaneval pass@1
baseline · 30d · 8 arms · ⌘K palette
armbase $/callcf $/callΔ$ · 30d
claude-opus-4-7$0.0330$0.0330$0.00
claude-sonnet-4-6$0.0066$0.0066$0.00
gpt-4o$0.0047$0.0047$0.00
gpt-4o-mini$0.0003$0.0003$0.00
gemini-1.5-pro$0.0024$0.0024$0.00
llama-3.1-405b$0.0035$0.0035$0.00
mistral-large-2$0.0032$0.0032$0.00
haiku-4-5$0.0022$0.0022$0.00
posterior mean · 30d+$0.00 · 95% CI [0.00, 0.00]