▶ modelweave.io / replay● live
Counterfactual replay — 30 days of traffic, any rule.
Real 2026 numbers: Opus 4 HumanEval 90.2, GPT-4o 87.0 (eval card), Gemini 1.5 Pro 84.1, Llama-3.1-405B 89.0.
rules
counterfactual scatter · cost × quality
x · input cost $/1My · humaneval pass@1
baseline · 30d · 8 arms · ⌘K palette
armbase $/callcf $/callΔ$ · 30d
claude-opus-4-7$0.0330$0.0330$0.00
claude-sonnet-4-6$0.0066$0.0066$0.00
gpt-4o$0.0047$0.0047$0.00
gpt-4o-mini$0.0003$0.0003$0.00
gemini-1.5-pro$0.0024$0.0024$0.00
llama-3.1-405b$0.0035$0.0035$0.00
mistral-large-2$0.0032$0.0032$0.00
haiku-4-5$0.0022$0.0022$0.00
posterior mean · 30d+$0.00 · 95% CI [0.00, 0.00]