VOL. 01· EVALUATION REPORT· MAY 2026

Can LLMs answer ASL rules questions
for less than 1 cent per query?

A human-verified comparison across models on a 200+ question eval set, split between pure rule lookup (Recall) and rule-plus-computation (Calc). Production cost and latency from live chat included.

01
Findings

Three findings.

  1. i
    Recall is solved on this simple eval.
    gpt-5.4 and gpt-5-mini both hit 99% on pure rule lookup for this simple eval. The gap shows up in latency: gpt-5-mini runs about 2× as slow as gpt-5.4. Next step: a harder eval — currently being built — to find where they actually start to break.
  2. ii
    gpt-5.4 wins on accuracy-and-latency — at 10× the cost.
    On the combined axes of accuracy and latency, gpt-5.4 is ahead, but gpt-5.4 costs 6 cents per query vs. 0.7 cents for gpt-5-mini.
  3. iii
    All models struggle with calculations.
    Calc accuracy spreads 50–86% across the lineup. The top two on recall are also the top two here — gpt-5.4 and gpt-5-mini — but neither clears 90%. Next step: failure-mode analysis on calc errors to see whether the wall is retrieval, reasoning, or arithmetic.
02
Model comparison

Accuracy, cost, latency
side by side.

TABLE 01 — 200+ questions, human-verified · cost & timing from live production chat

Model Recall Calc Cost / q Avg time Date run
gpt-5.4
frontier baseline
99%* 79%* Mar 23 VIEW →
gpt-5.4-mini 93% 61% Mar 22 VIEW →
gpt-5-mini 99% 86% Mar 7 VIEW →
deepseek-v3
via OpenRouter
88% 66% May 21 VIEW →
gpt-4.1-mini 91% 50% Feb 18 VIEW →
mercury-2
via OpenRouter
82% 59% May 23 VIEW →

Note — all runs use the same retrieval configuration: ~20k token RAG input per question, 20 chunks retrieved per query from a vector store of the 700-page rulebook.

Caveat — deepseek-v3 and mercury-2 make two API calls per query — one to OpenAI's vector store for retrieval, then one to OpenRouter for inference. OpenAI-native rows do both in a single server-side call. So the Avg time column overstates how fast those models actually are at inference; the extra round-trip is baked in.

* Estimated: gpt-5.4 run on 24 questions where gpt-5.4-mini failed; remaining 190 assumed correct.

03
Production usage

Cost and latency,
in flight.

Daily averages from live chat · text-only baselines, image variants dashed

Cost / question ($)
Response time (s)