A human-verified comparison across models on a 200+ question eval set, split between pure rule lookup (Recall) and rule-plus-computation (Calc). Production cost and latency from live chat included.
TABLE 01 — 200+ questions, human-verified · cost & timing from live production chat
| Model | Recall | Calc | Cost / q | Avg time | Date run | |
|---|---|---|---|---|---|---|
|
gpt-5.4
frontier baseline
|
99%* | 79%* | — | — | Mar 23 | VIEW → |
| gpt-5.4-mini | 93% | 61% | — | — | Mar 22 | VIEW → |
| gpt-5-mini | 99% | 86% | — | — | Mar 7 | VIEW → |
|
deepseek-v3
via OpenRouter
|
88% | 66% | — | — | May 21 | VIEW → |
| gpt-4.1-mini | 91% | 50% | — | — | Feb 18 | VIEW → |
|
mercury-2
via OpenRouter
|
82% | 59% | — | — | May 23 | VIEW → |
Note — all runs use the same retrieval configuration: ~20k token RAG input per question, 20 chunks retrieved per query from a vector store of the 700-page rulebook.
Caveat — deepseek-v3 and mercury-2 make two API calls per query — one to OpenAI's vector store for retrieval, then one to OpenRouter for inference. OpenAI-native rows do both in a single server-side call. So the Avg time column overstates how fast those models actually are at inference; the extra round-trip is baked in.
* Estimated: gpt-5.4 run on 24 questions where gpt-5.4-mini failed; remaining 190 assumed correct.
Daily averages from live chat · text-only baselines, image variants dashed