Evaluation Results

Accuracy measured across 213 basic rule lookup questions — human-verified against expected answers.

99%
Recall accuracy
gpt-5-mini · 183 / 185 questions
86%
Calc accuracy
gpt-5-mini · 50% for gpt-4.1-mini
Too Slow
gpt-5-mini latency
~20k token RAG context per query

gpt-5-mini handles basic questions well — but it's too slow. 99% recall and 86% calc accuracy on basic, single-rule questions is strong, but the ~20k token RAG context injected per query pushes response times to an unacceptable range for a chat interface.

gpt-4.1-mini is faster and solid on recall, but falls apart on calc. Recall accuracy is competitive, and latency is much more acceptable. The problem is calculation: at roughly 50% calc accuracy, it frequently gets the arithmetic wrong even when it retrieves the right rules — making it unreliable for questions that require any computation.

Both models are mini-class, chosen for cost. These results cover basic questions only — harder scenarios with edge-case interactions or chained reasoning are not yet in this eval set. All results confirmed by human review.

DATE MODEL EVAL JUDGE QUESTION TYPE PASS FAIL NEEDS REVIEW TOTAL ACTIONS
2026-03-07 gpt-5-mini asl-evals-combined Human Review Recall 183 (99%) 2 (1%) 0 (0%) 185 View Details
2026-03-07 gpt-5-mini asl-evals-combined Human Review Calc 24 (86%) 4 (14%) 0 (0%) 28 View Details
2026-02-18 gpt-4.1-mini asl-evals-combined Human Review Recall 170 (91%) 16 (9%) 0 (0%) 186 View Details
2026-02-18 gpt-4.1-mini asl-evals-combined Human Review Calc 14 (50%) 14 (50%) 0 (0%) 28 View Details
Notes: Calc = calculation questions (recall + computation) · Recall = pure rule lookup · Only Human Review results are shown

Production Usage

Token consumption, cost, and response time from live chat interactions.

Tokens per Question
Cost per Question
Response Time per Question

Data from production chat interactions