Evaluation Report — ASL Ruleschat

01

Findings

Three findings.

i

Recall is solved on this simple eval.

gpt-5.4 and gpt-5-mini both hit 99% on pure rule lookup for this simple eval. The gap shows up in latency: gpt-5-mini runs about 2× as slow as gpt-5.4. Next step: a harder eval — currently being built — to find where they actually start to break.
ii

gpt-5.4 wins on accuracy-and-latency — at 10× the cost.

On the combined axes of accuracy and latency, gpt-5.4 is ahead, but gpt-5.4 costs 6 cents per query vs. 0.7 cents for gpt-5-mini.
iii

All models struggle with calculations.

Calc accuracy spreads 50–86% across the lineup. The top two on recall are also the top two here — gpt-5.4 and gpt-5-mini — but neither clears 90%. Next step: failure-mode analysis on calc errors to see whether the wall is retrieval, reasoning, or arithmetic.

02

Model comparison

Accuracy, cost, latency
side by side.

TABLE 01 — 200+ questions, human-verified · cost & timing from live production chat

Model	Recall	Calc	Cost / q	Avg time	Date run
gpt-5.4 frontier baseline	99%^*	79%^*	—	—	Mar 23	VIEW →
gpt-5.4-mini	93%	61%	—	—	Mar 22	VIEW →
gpt-5-mini	99%	86%	—	—	Mar 7	VIEW →
deepseek-v3 via OpenRouter	88%	66%	—	—	May 21	VIEW →
gpt-4.1-mini	91%	50%	—	—	Feb 18	VIEW →
mercury-2 via OpenRouter	82%	59%	—	—	May 23	VIEW →

Note — all runs use the same retrieval configuration: ~20k token RAG input per question, 20 chunks retrieved per query from a vector store of the 700-page rulebook.

Caveat — deepseek-v3 and mercury-2 make two API calls per query — one to OpenAI's vector store for retrieval, then one to OpenRouter for inference. OpenAI-native rows do both in a single server-side call. So the Avg time column overstates how fast those models actually are at inference; the extra round-trip is baked in.

^* Estimated: gpt-5.4 run on 24 questions where gpt-5.4-mini failed; remaining 190 assumed correct.

03

Production usage

Cost and latency,
in flight.

Daily averages from live chat · text-only baselines, image variants dashed

Cost / question ($)

Response time (s)

Three findings.

Accuracy, cost, latencyside by side.

Cost and latency,in flight.

Accuracy, cost, latency
side by side.

Cost and latency,
in flight.