Evaluation Results
Accuracy measured across 213 basic rule lookup questions — human-verified against expected answers.
gpt-5-mini handles basic questions well — but it's too slow. 99% recall and 86% calc accuracy on basic, single-rule questions is strong, but the ~20k token RAG context injected per query pushes response times to an unacceptable range for a chat interface.
gpt-4.1-mini is faster and solid on recall, but falls apart on calc. Recall accuracy is competitive, and latency is much more acceptable. The problem is calculation: at roughly 50% calc accuracy, it frequently gets the arithmetic wrong even when it retrieves the right rules — making it unreliable for questions that require any computation.
Both models are mini-class, chosen for cost. These results cover basic questions only — harder scenarios with edge-case interactions or chained reasoning are not yet in this eval set. All results confirmed by human review.
| DATE | MODEL | EVAL | JUDGE | QUESTION TYPE | PASS | FAIL | NEEDS REVIEW | TOTAL | ACTIONS |
|---|---|---|---|---|---|---|---|---|---|
| 2026-03-07 | gpt-5-mini | asl-evals-combined | Human Review | Recall | 183 (99%) | 2 (1%) | 0 (0%) | 185 | View Details |
| 2026-03-07 | gpt-5-mini | asl-evals-combined | Human Review | Calc | 24 (86%) | 4 (14%) | 0 (0%) | 28 | View Details |
| 2026-02-18 | gpt-4.1-mini | asl-evals-combined | Human Review | Recall | 170 (91%) | 16 (9%) | 0 (0%) | 186 | View Details |
| 2026-02-18 | gpt-4.1-mini | asl-evals-combined | Human Review | Calc | 14 (50%) | 14 (50%) | 0 (0%) | 28 | View Details |
Production Usage
Token consumption, cost, and response time from live chat interactions.
Tokens per Question
Cost per Question
Response Time per Question
Data from production chat interactions