Softgen

AI

AI evals explained — how to stop shipping demos

9 min readUpdated 19 June 2026

Key takeaways

  • An eval is a concrete input + known-good output (or rubric) that you run automatically every time you change model, prompt or tools.
  • Without evals you have no way to know whether a tweak made things better or worse — that's why most AI features regress in production.
  • Good evals cover accuracy, safety, cost, latency and task completion rate. Start with 50-200 golden examples for your key workflows.
  • Production AI work at Softgen always includes an eval plan and harness; builds from £18,000 bake this in so you can sleep at night.

The short answer

Evals turn "the AI seems smarter" into a measurable number you can track over time. They are the difference between a charming prototype and a feature you can confidently put in front of customers or internal teams.

Why every AI project needs them from day one

LLMs are non-deterministic. Change the model, add a tool, tweak temperature or retrieval and behaviour shifts. Without a test suite you are flying blind. Regressions appear as "it used to be great and now it hallucinates on edge cases".

What a production eval suite actually looks like

  • Golden dataset: real (anonymised) inputs from your domain + the output you want.
  • Automated scoring: exact match, semantic similarity, LLM-as-judge with a tight rubric, or task success (did the tool calls produce the right side effect?).
  • Metrics across dimensions: factual accuracy, tone/format adherence, refusal rate on bad inputs, cost per run, latency p95, escalation rate.
  • Regression gates in CI or pre-deploy.

Guardrails are the runtime version

Evals prove quality before you ship. Guardrails catch problems at runtime: input filters, output validators, PII redaction, confidence thresholds, forced hand-off to humans. Both are required.

How we build and use evals at Softgen

During discovery we define the success criteria with you. We build the first 50–200 examples, wire an eval runner, and use it to validate every iteration. When we hand over, you get the harness so future changes don't silently degrade the feature. This is part of every AI build from £18,000.

If your current AI feature feels unreliable or you're about to add one, the first question is always: what's the eval plan? Send a brief and we'll map it with you.

/01FAQ

Quick answers.

How many evals do I need to start?

50-100 high-quality examples for your core workflows is usually enough to begin. Quality and coverage beat volume. We help you prioritise the ones that catch the expensive or embarrassing failures.

Can evals be automated completely?

Most of the scoring can. Task completion and side-effect checks are fully automatic. Factual or subjective ones often use a strong LLM judge with a strict rubric plus spot human review.

Do evals slow us down?

They speed you up long-term. You can change models or prompts aggressively and know within minutes whether quality moved. Without them every change is a gamble.

Who owns evals after launch?

You should. We build the harness and seed the dataset during the engagement, then hand it over so your team can extend it as the product and usage evolve.

/02Keep reading

Related guides.

All insights

Ready when you are

Let's build the thing.

Tell us what you're building and we'll come back with a plan, a price and a date. No obligation, no jargon.