LLM Evaluation in Production: Testing Pipelines (2026)

Written by Virtido | Feb 26, 2026 10:00:00 AM

You built an LLM application that works brilliantly in your demos. The product team is excited. Leadership wants it in production by next quarter. But then comes the question that stops every AI project cold: how do you know it actually works?

Traditional software testing gives you binary answers — the function returns the expected output or it doesn't. LLM evaluation is fundamentally different. Your model might give five different valid answers to the same question. It might be factually correct but miss the user's intent. It might work perfectly on your test cases and fail spectacularly on real user queries.

TL;DR: LLM evaluation requires different thinking than traditional software testing. You can't just check for exact matches — you need to measure factuality, relevance, coherence, safety, and task completion across diverse test cases. Build evaluation datasets from real user queries, use automated evaluation techniques (LLM-as-Judge, embedding similarity, traditional NLP metrics), integrate testing into CI/CD, and continuously monitor production quality. This guide covers the complete pipeline from dataset creation to production monitoring.

Why LLM Evaluation is Different

Software engineers approach LLM evaluation expecting it to work like unit testing. It doesn't. Understanding why helps you build better evaluation systems.

Non-Deterministic Outputs

Even with temperature set to zero, LLM outputs can vary between API calls due to infrastructure changes, model updates, or floating-point precision differences. The same prompt might generate "The capital of France is Paris" today and "Paris is the capital of France" tomorrow. Both are correct, but a naive string comparison fails.

Multiple Valid Answers

Ask an LLM to summarize a document and you'll get dozens of valid summaries. Ask it to write code for a function and there are infinite correct implementations. Your evaluation system must recognize that different doesn't mean wrong.

Evaluation Requires Judgment

Determining if an LLM response is "good" often requires human-level understanding. Is the answer factually correct? Does it address the user's actual intent? Is the tone appropriate? These judgments don't reduce to simple assertions.

Failures are Subtle

LLMs rarely crash — they fail gracefully by generating plausible-sounding nonsense. A hallucinated fact looks identical to a correct one. A subtly wrong code suggestion compiles and runs but produces incorrect results. Your evaluation must catch these silent failures.

Evaluation Dimensions

Production LLM systems need evaluation across multiple dimensions. A response can score well on factuality but fail on relevance, or be highly relevant but poorly structured.

Dimension	What It Measures	How to Evaluate
Factuality	Are claims accurate and verifiable?	Compare against source documents, fact-checking, retrieval verification
Relevance	Does the response address the query?	Semantic similarity to ideal answer, LLM-as-Judge scoring
Coherence	Is the response well-structured and logical?	LLM-as-Judge, readability metrics, structure validation
Safety	Does it avoid harmful or inappropriate content?	Safety classifiers, toxicity detection, policy compliance checks
Task Completion	Did it accomplish what the user wanted?	Execution testing (for code), format validation, end-to-end checks
Groundedness	Are claims supported by provided context?	Citation verification, context overlap analysis
Latency	Response time acceptable for use case?	Direct measurement, percentile tracking
Cost	Token usage within budget constraints?	Token counting, cost tracking per query type

Not every dimension matters equally for every application. A customer support bot prioritizes factuality and safety. A creative writing assistant weights coherence and relevance higher. Define which dimensions matter most for your use case before building evaluation infrastructure.

Building Evaluation Datasets

Your evaluation is only as good as your test cases. Most teams underinvest in dataset creation and pay for it with unreliable quality signals.

Collecting Real Queries

Start with actual user queries, not synthetic examples you invented. Production queries reveal edge cases, phrasing variations, and use patterns you would never anticipate. Sources include:

Production logs — Sample from actual queries (anonymized if needed)
Support tickets — Questions customers actually ask
Search logs — What users search for in your product
Failure cases — Queries that triggered complaints or escalations

Creating Golden Answers

Each test query needs a reference answer — the "golden" response against which you compare model outputs. Golden answers should be:

Expert-reviewed — Written or validated by domain experts
Comprehensive — Cover all key points the response should include
Annotated — Include metadata about required facts, format expectations, citations needed

Evaluation Dataset Format

Structure your evaluation dataset for automated testing:

{
  "test_cases": [
    {
      "id": "tc-001",
      "query": "What are the refund policies for annual subscriptions?",
      "context": "Retrieved documents would go here for RAG evaluation",
      "golden_answer": "Annual subscriptions can be refunded within 30 days of purchase...",
      "required_facts": [
        "30-day refund window",
        "Pro-rated refunds after 30 days",
        "No refunds for used credits"
      ],
      "evaluation_criteria": {
        "must_mention": ["30 days", "refund"],
        "must_not_mention": ["guaranteed", "unlimited"],
        "expected_format": "paragraph",
        "max_length": 200
      },
      "metadata": {
        "category": "billing",
        "difficulty": "easy",
        "source": "production_logs"
      }
    }
  ]
}

Dataset Maintenance

Evaluation datasets aren't static. Plan for ongoing maintenance:

Regular updates — Add new test cases from production failures
Golden answer refresh — Update when underlying information changes
Balance monitoring — Ensure coverage across categories and difficulty levels
Versioning — Track dataset changes alongside model changes

A healthy evaluation dataset grows continuously. Start with 50-100 high-quality test cases, then expand based on production learnings.

Automated Evaluation Techniques

Manual review doesn't scale. You need automated evaluation that runs on every change.

LLM-as-Judge

Use a capable LLM to evaluate outputs from your production model. This approach captures nuanced quality judgments that rules can't express.

from openai import OpenAI

client = OpenAI()

def evaluate_response(query: str, response: str, golden_answer: str) -> dict:
    """Use GPT-4 to evaluate response quality."""

    evaluation_prompt = f"""You are evaluating an AI assistant's response.

Query: {query}

AI Response: {response}

Reference Answer: {golden_answer}

Evaluate the response on these criteria (score 1-5):
1. Factuality: Are all claims accurate?
2. Relevance: Does it address the query?
3. Completeness: Does it cover key points from the reference?
4. Clarity: Is it well-written and easy to understand?

Respond in JSON format:
,
    "relevance": ,
    "completeness": ,
    "clarity": ,
    "overall_pass": true/false,
    "issues": ["list of specific problems if any"]
}}"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": evaluation_prompt}],
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)


# Run evaluation
result = evaluate_response(
    query="What is your return policy?",
    response=model_output,
    golden_answer=test_case["golden_answer"]
)

assert result["overall_pass"], f"Evaluation failed: {result['issues']}"

Embedding-Based Metrics

Measure semantic similarity between model output and reference answers using embeddings. This catches responses that say the same thing differently.

Cosine similarity — Compare embedding vectors of response and golden answer
BERTScore — Token-level similarity using contextual embeddings
Semantic textual similarity — Purpose-built models for comparing text meaning

Embedding metrics work well for factual Q&A where the "right answer" exists. They struggle with creative or open-ended tasks.

Traditional NLP Metrics

Classic metrics still have their place:

BLEU/ROUGE — N-gram overlap, useful for summarization and translation
Exact match — For structured outputs like entity extraction
Regex validation — Check format compliance (JSON, email, dates)
Keyword presence — Verify required terms appear in response

Specialized Evaluators

Some evaluation tasks need purpose-built tools:

Code execution — Run generated code and check outputs
SQL validation — Execute queries and compare results
Math verification — Compute expected answers and compare
Citation checking — Verify claims trace back to sources (critical for RAG systems)

Evaluation Frameworks Comparison

Several open-source frameworks help structure LLM evaluation. Each has different strengths.

Framework	Best For	Key Features	Limitations
deepeval	Unit testing LLMs	Pytest integration, built-in metrics, CI/CD friendly	Smaller community, less customizable
ragas	RAG evaluation	RAG-specific metrics (context relevance, faithfulness), easy setup	Focused on RAG, less general-purpose
promptfoo	Prompt engineering	YAML config, side-by-side comparisons, model agnostic	Less Python-native, more CLI-focused
LangSmith	LangChain users	Deep LangChain integration, tracing, dataset management	LangChain ecosystem lock-in
Weights & Biases	ML teams	Experiment tracking, visualization, team collaboration	More complex setup, enterprise pricing

deepeval Example

deepeval integrates with pytest for familiar testing patterns:

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualRelevancyMetric
)

def test_rag_response():
    """Test that RAG responses are relevant and grounded."""

    test_case = LLMTestCase(
        input="What is the refund policy for annual plans?",
        actual_output=rag_pipeline.query("What is the refund policy for annual plans?"),
        expected_output="Annual plans can be refunded within 30 days...",
        retrieval_context=[
            "Refund Policy: Annual subscriptions are eligible for full refund within 30 days of purchase.",
            "After 30 days, refunds are prorated based on remaining subscription period."
        ]
    )

    # Define evaluation metrics
    relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
    faithfulness_metric = FaithfulnessMetric(threshold=0.8)
    context_metric = ContextualRelevancyMetric(threshold=0.7)

    # Run assertions
    assert_test(test_case, [relevancy_metric, faithfulness_metric, context_metric])


def test_no_hallucination():
    """Verify response doesn't contain claims unsupported by context."""

    test_case = LLMTestCase(
        input="What premium features are included?",
        actual_output=model_response,
        retrieval_context=retrieved_docs
    )

    # Faithfulness checks that claims are grounded in context
    faithfulness = FaithfulnessMetric(threshold=0.9)
    assert_test(test_case, [faithfulness])

CI/CD Integration

Evaluation should run automatically on every change. Catching regressions before deployment is far cheaper than fixing them in production.

GitHub Actions Workflow

name: LLM Evaluation Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  OPENAI_API_KEY: $

jobs:
  evaluate:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install deepeval pytest

      - name: Run evaluation suite
        run: |
          deepeval test run tests/evaluation/

      - name: Run regression tests
        run: |
          pytest tests/evaluation/ -v --tb=short

      - name: Upload evaluation results
        uses: actions/upload-artifact@v4
        with:
          name: evaluation-results
          path: evaluation_results/

      - name: Check quality gates
        run: |
          python scripts/check_quality_gates.py \
            --min-relevancy 0.8 \
            --min-faithfulness 0.85 \
            --max-hallucination-rate 0.05

      - name: Comment PR with results
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('evaluation_results/summary.json'));

            const body = `## LLM Evaluation Results

            | Metric | Score | Threshold | Status |
            |--------|-------|-----------|--------|
            | Relevancy | ${results.relevancy.toFixed(2)} | 0.80 | ${results.relevancy >= 0.8 ? 'Pass' : 'Fail'} |
            | Faithfulness | ${results.faithfulness.toFixed(2)} | 0.85 | ${results.faithfulness >= 0.85 ? 'Pass' : 'Fail'} |
            | Hallucination Rate | ${results.hallucination_rate.toFixed(2)} | 0.05 | ${results.hallucination_rate <= 0.05 ? 'Pass' : 'Fail'} |

            **Test Cases:** ${results.total_tests} | **Passed:** ${results.passed} | **Failed:** ${results.failed}
            `;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

Quality Gates

Define clear thresholds that must pass before deployment:

Minimum accuracy scores — Relevancy, faithfulness, factuality above thresholds
Maximum regression — No more than X% degradation from baseline
Zero critical failures — Safety violations, format errors, or complete failures block deployment
Coverage requirements — Evaluation must cover all major query categories

Production Monitoring

Evaluation doesn't stop at deployment. Production traffic reveals failure modes that testing misses.

Online Evaluation

Run evaluation continuously on production traffic:

Sample-based evaluation — Evaluate a random sample of queries automatically
Shadow evaluation — Run candidate models alongside production and compare
A/B testing — Compare model versions with real users

User Feedback Integration

Explicit and implicit user signals are invaluable:

Thumbs up/down — Direct quality signal from users
Regeneration requests — User asking for new response indicates failure
Follow-up questions — May indicate incomplete or confusing responses
Task completion — Did the user accomplish their goal?

Alerting on Quality Degradation

Set up alerts for quality drops:

Rolling accuracy metrics — Track 24h rolling average, alert on significant drops
Error rate spikes — Sudden increase in format errors or failed generations
Latency degradation — P95 latency exceeding thresholds
User satisfaction drops — Declining thumbs-up ratio or increasing complaints

Common Evaluation Mistakes

Teams new to LLM evaluation often fall into these traps.

Testing on Training Data

If your test cases overlap with data used to fine-tune or prompt-engineer the model, you're measuring memorization, not generalization. Keep strict separation between training and evaluation data.

Overfitting to Benchmarks

Optimizing for evaluation metrics can diverge from actual user satisfaction. A model that scores perfectly on your test suite but frustrates real users is a failure. Include user feedback in your evaluation loop.

Ignoring Edge Cases

Most evaluation datasets skew toward common, easy cases. Real production traffic includes adversarial inputs, unusual phrasing, out-of-scope queries, and combined intents. Deliberately include difficult cases in your test suite.

Static Evaluation Sets

A fixed evaluation dataset becomes stale as your product evolves and user behavior changes. Continuously refresh test cases from production data.

Single-Metric Focus

Optimizing for one metric often degrades others. A model that maximizes factuality by being extremely conservative will fail on completeness. Track multiple dimensions and understand their trade-offs.

Insufficient Test Volume

With 20 test cases, a single evaluation failure swings your accuracy by 5%. You need enough test cases for statistically meaningful results — typically 100+ for core functionality, 500+ for production readiness.

How Virtido Can Help You Build LLM Evaluation Pipelines

At Virtido, we help enterprises build production-grade AI systems, including the evaluation infrastructure that ensures they work reliably. Our teams combine ML engineering expertise with practical deployment experience.

What We Offer

Evaluation pipeline design — Architect testing systems tailored to your LLM applications and use cases
Dataset creation and curation — Build high-quality evaluation datasets from your production data
CI/CD integration — Implement automated evaluation in your development workflow
Production monitoring — Deploy real-time quality tracking and alerting systems
AI/ML engineers on demand — Augment your team with vetted specialists in 2-4 weeks

We've built evaluation systems for LLM applications across financial services, healthcare, e-commerce, and enterprise software. Our staff augmentation model provides Swiss contracts and full IP protection.

Final Thoughts

LLM evaluation is the unsexy work that separates demos from production systems. Without rigorous evaluation, you're deploying AI on faith — hoping it works, unable to detect when it degrades, and surprised when users complain. Building proper evaluation infrastructure requires upfront investment, but it pays dividends in deployment confidence, faster iteration, and early detection of problems.

The key insight is that LLM evaluation requires a different mindset than traditional software testing. You're not checking for exact correctness — you're measuring quality across multiple dimensions, accepting that "correct" often means "good enough for the use case." This requires human judgment encoded into automated systems, continuous monitoring of production behavior, and honest acknowledgment of what your tests can and cannot catch.

Start simple: build a small evaluation dataset from real queries, implement LLM-as-Judge for core quality dimensions, integrate testing into your CI/CD pipeline, and monitor production feedback. Expand coverage as you learn where your system fails. The goal isn't perfect evaluation — it's sufficient confidence that your system works well enough to serve real users, and fast feedback when it stops working.

Frequently Asked Questions

How many test cases do I need for LLM evaluation?

Start with 50-100 high-quality test cases covering your core use cases. For production readiness, aim for 500+ test cases across all query categories. The key is quality over quantity — 100 well-crafted test cases with accurate golden answers beat 1,000 poorly constructed ones. Expand continuously by sampling from production failures and edge cases.

Can I use GPT-4 to evaluate GPT-4 outputs?

Yes, and this approach (LLM-as-Judge) works well for many use cases. Research shows GPT-4 evaluations correlate strongly with human judgments for factuality and relevance. However, be aware of potential biases — the same model may be lenient toward its own style of responses. For critical applications, combine LLM-as-Judge with human review on a sample of cases and use a different model family for evaluation when possible.

What accuracy targets should I set for production LLM systems?

Targets depend on your use case and the cost of errors. Customer-facing applications typically require 85-95% relevance scores. Safety-critical applications (medical, legal, financial) need 95%+ factuality with near-zero tolerance for harmful outputs. Internal tools can often tolerate lower thresholds (75-85%). Start with baselines from your current system (even if manual) and set incremental improvement targets.

How do I test for hallucinations specifically?

Test hallucination through faithfulness evaluation: check whether claims in the response are supported by provided context or verifiable sources. For RAG systems, use metrics like RAGAS faithfulness score that compare generated statements against retrieved documents. Include test cases with questions the model shouldn't be able to answer — a good system says "I don't know" rather than inventing facts. Regularly audit production responses for factual accuracy.

How do I evaluate RAG systems differently from standard LLMs?

RAG evaluation adds retrieval quality to generation quality. Evaluate the retrieval component separately: are the right documents being retrieved? Then evaluate generation: is the response faithful to retrieved context? Key RAG-specific metrics include context relevance (are retrieved docs relevant to query?), context utilization (does the response use the context?), and groundedness (are claims traceable to sources?). See our RAG guide for architecture details.

How often should I re-evaluate my LLM system?

Run automated evaluation on every code change that touches the LLM pipeline. For production monitoring, sample and evaluate continuously — at minimum daily, ideally on a rolling basis throughout the day. Re-run full evaluation suites whenever you update prompts, switch models, retrain embeddings, or change retrieval logic. Also trigger evaluation when you observe quality issues in production metrics.

What do I do when human annotators disagree on whether a response is correct?

Annotator disagreement reveals ambiguity in your evaluation criteria or genuinely difficult cases. First, improve annotation guidelines to reduce ambiguity. Use multiple annotators per case and measure inter-annotator agreement (Krippendorff's alpha or Cohen's kappa). For cases with legitimate disagreement, consider accepting multiple valid answers or using majority voting. High-disagreement cases often make poor automated test cases — flag them for human review instead.

How do I handle non-determinism in LLM outputs during testing?

Set temperature to 0 for reproducibility, though this doesn't guarantee identical outputs across API calls. Run each test case multiple times (3-5) and evaluate statistical consistency rather than exact matches. Use semantic similarity and LLM-as-Judge evaluation rather than exact string comparison. For CI/CD, set pass thresholds that account for variance — require 4/5 runs to pass rather than perfect consistency. Track variance as a metric itself.

Should I evaluate prompts separately from models?

Yes, treat prompts as code that needs testing. When changing prompts, run evaluation against a fixed model to isolate prompt impact. When changing models, evaluate with fixed prompts. This separation helps you understand whether regressions come from prompt changes or model changes. Version control prompts alongside code and include them in your evaluation dataset metadata so you can reproduce historical evaluations.

What's the best way to evaluate multi-turn conversations?

Multi-turn evaluation is significantly harder than single-turn. Test conversation-level metrics: does the assistant maintain context? Does it handle reference resolution ("What about the second option?")? Does conversation quality degrade over turns? Create test scenarios that cover common multi-turn patterns (clarification, follow-up, topic switch). Evaluate both individual turn quality and overall conversation success. Track metrics like conversation completion rate and turns-to-resolution.

View full post