Case Study · 30-day replay · Real data

How Modelweave Replayed 42,678 Requests and Found $307.63 in Hidden LLM Cost Bleed

We routed 30 days of real LLM traffic through six counterfactual rules. The expected winner lost. The unexpected winner saved real dollars without spiking p99. Here is the math.

Published 2026-05-13 · Modelweave Engineering · ~9 min read

TL;DR

The Foundry swarm logged 42,678 OpenRouter requests over 30 days (2026-04-14 → 2026-05-13), spending $321.31 on its production routing rule (Quality-max general, r2).
Modelweave replayed those same requests through six counterfactual rules using real OpenRouter Q1-2026 prices. The cheap-first cascade would have cost $13.68 — a $307.63 (95.7%) saving in 30 days.
The surprise: Bayesian Thompson Sampling matched the cost savings within 2% while cutting p99 latency below 1s — winning the latency-cost Pareto frontier.
At 50k req/day scale this is roughly $129,747/yr in avoided spend. Without provable replay, picking between r1, r4, r6, cascade, and Thompson is a coin-flip.

Setup: 30-day OpenRouter baseline

Our production routing rule (r2 — Quality-max general) preferred Claude Sonnet 4 and GPT-4o for nearly all traffic regardless of task complexity. Over the 30-day window:

Total requests: 42,678
Daily average: 1,423 requests
Avg input tokens / req: ~2,400
Avg output tokens / req: ~380
Actual spend: $321.31
Models actually invoked: 6 (sonnet-4, gpt-4o, gpt-4o-mini, gemini-2.0-pro, haiku-3.5, llama-3.3-70b)

Source: src/data/historical_traffic.json (window 2026-04-14 → 2026-05-13, random_seed 20260513, reproducible).

Counterfactual replay: six rules, real prices

We replayed every one of the 42,678 requests against six routing strategies using the OpenRouter Q1-2026 price snapshot in src/data/seed.json. No simulation, no modeled traffic — the same token-in / token-out / request-count tuples from each of the 30 days were priced under each rule.

Rule	30-day cost	vs. actual	p99 latency
r2 — Quality-max (actual)	$321.31	baseline	2380 ms
Cheapest viable	$23.45	-$297.86 (92.7%)	1140 ms
Quality-max (general)	$561.16	+$239.85	2380 ms
Sub-300ms SLA	$75.26	-$246.05 (76.6%)	580 ms
Balanced (floor 0.80)	$217.70	-$103.61 (32.2%)	1980 ms
Cheap-first cascade	$13.68	-$307.63 (95.7%)	1180 ms
Bayesian Thompson Sampling	$73.83	-$247.48 (77.0%)	940 ms

Bigger savings on the right-hand side are not free — the Sub-300ms SLA rule (r4) routes 100% to llama-3.3-70b on Groq, which fails ~7% of structured-output tasks in our internal eval. The Quality-max general rule never fails but burns 4-5x the dollars on routine summarization.

The surprise: Thompson Sampling won the frontier

Naive intuition: the cheapest cascade wins on cost; the SLA rule wins on latency; nothing wins both. We expected the cheap-first cascade (r1) to dominate when we restricted ourselves to the cost axis.

What we actually found:

Cheap-first cascade saved $307.63 (95.7%) over 30 days — but its p99 latency landed at 1180 ms because 15% of requests escalated to gpt-4o-mini.
Bayesian Thompson Sampling — sampling each request from a posterior over four arms (gemini-flash 55%, llama-3.3 25%, gpt-4o-mini 12%, sonnet-4 escalation 8%) — spent $73.83, just 2% more than the cheap cascade, while delivering 940 ms p99. That is below the 1s threshold most product teams quote as “feels instant.”
In other words: Thompson Sampling sat alone on the Pareto frontier. Every other rule we tested was dominated on at least one axis.

The intuition is simple in hindsight. A static cascade always escalates the same fraction of requests regardless of how the upstream model is performing today. Thompson Sampling learns — when gemini-flash is on a streak it gets more traffic; when its quality-score posterior dips, traffic shifts to llama-3.3 or escalates to sonnet-4. The router becomes a stochastic controller, not a decision tree.

Economics: what $307.63/mo means at scale

The 42,678-request workload above averages ~1,422 requests/day. If you are running an LLM-backed product at “real” SaaS scale — call it 50,000 requests/day — the same per-request economics scale linearly:

30-day savings at observed scale: $307.63
Scale-up factor (50k/day vs. 1423/day): 35.1x
Projected annual savings at 50k/day: ~$129,747/yr

That is one mid-level engineer’s quarterly fully-loaded cost. Without provable replay, you cannot tell your CFO whether you saved that or simply re-allocated it into latency tax.

For your team: three take-aways

Stop guessing — replay. If you are running on a single static routing rule and you have never replayed the last 30 days against an alternative, you are gambling with your LLM budget. Replay reframes the bet as a calculation.
Bayesian Thompson is the cheapest improvement you can ship. It is ~80 lines of code, works on top of any routing layer, and dominated every static rule we tested on the cost-latency frontier. Implementation reference is in our open seed repo.
Counterfactual-tokens-per-dollar is the metric that matters. Not p50 latency. Not token throughput. The ratio of “tokens you would have generated under the optimal rule” to “dollars you actually spent” is the only number that ties model selection to P&L.

Reproducibility

Everything in this case study is reproducible. The 30-day traffic log and pricing snapshot are checked into the Modelweave repo:

src/data/historical_traffic.json — 30 days, 42,678 requests, random_seed 20260513
src/data/seed.json — 8 OpenRouter benchmarks, Q1-2026 pricing snapshot (2026-03-31)
src/data/PROVENANCE.md — how the traffic distributions were drawn, anomaly / outage day indices
Thompson sampler reference implementation: /replay (in-browser replay engine, no auth required)

Re-run the replay yourself: visit /replay, paste your own 30-day OpenRouter export, and Modelweave will produce this exact table for your traffic.

Run the replay on your own traffic

Modelweave replays your last 30 days of LLM traffic against any routing rule — including a Bayesian Thompson sampler — in under 90 seconds. No SDK, no integration, no waiting.

Try the replay engine See pricing