Evaluation
The eval suite defines pre-deployment quality gates for your agent. Run evals standalone with zil eval run or as an automatic gate during zil pack.
Zil uses an adapter pattern for evaluation — DeepEval is the default engine, providing both deterministic assertions and LLM-as-judge metrics. The framework is selected in evals/config.yaml.
Creating test cases
Zil provides four ways to build eval test cases, from manual to fully automated:
| Method | Command | Best for |
|---|---|---|
| Write by hand | Edit evals/cases/*.yaml | Precise, regression-specific cases |
| Interactive builder | zil eval add | Chat with the agent, approve responses as golden cases |
| Session recording | zil eval record | Capture a natural conversation and convert it into cases |
| LLM generation | zil eval generate | Bootstrap a test suite from your agent’s identity files |
Interactive builder — zil eval add
Chat with your agent one message at a time. After each response, decide whether it should pass, tag keywords, and choose metrics. Cases are saved immediately.
zil eval add --group accuracyYou: What is Zil?
Running agent...
Agent: Zil is an open-source framework by FluentData for building production AI agents.
Should this response pass? [Y/n]: y
Expected keywords (comma-separated, or empty): framework, agent, FluentData
Metrics (comma-separated, or empty for deterministic-only): answer_relevancy
✓ Case saved (1 total)
Add another? [Y/n]:Session recording — zil eval record
Have a natural conversation with the agent. When you’re done, review the transcript and pick which turns become eval cases. Keywords are auto-detected from responses.
zil eval record --group regressionRecording session — chat with your agent.
Type /done or press Ctrl+D to finish.
You: How do I create a new agent?
Agent: Run `zil init my-agent` to scaffold a new project...
You: What LLM providers are supported?
Agent: Zil supports Gemini, Anthropic, OpenAI, and Vertex AI...
You: /done
Recorded 2 turn(s). Reviewing...
1. You: How do I create a new agent?
Agent: Run `zil init my-agent` to scaffold a new project...
Include as eval case? [Y/n]: y
Auto-detected keywords: agent, scaffold, project
Edit keywords (comma-separated, or Enter to keep): zil init, scaffold, project
✓ 2 case(s) saved to cases/regression.yamlLLM generation — zil eval generate
The judge LLM reads your agent’s identity/ files (persona, instructions, guardrails) and synthesizes diverse test cases automatically. Review each one before saving.
# Generate 20 cases focused on guardrails
zil eval generate --count 20 --category guardrails
# Generate and save all without review
zil eval generate --no-review --count 15 --group edge-casesAll three methods auto-register new case files in the suite YAML — no manual wiring needed.
Engine configuration
# evals/config.yaml
eval_engine:
framework: deepeval
judge:
# LLM used for evaluation scoring (separate from agent LLM)
provider: gemini
model: gemini-2.0-flash
api_key_env: GOOGLE_API_KEY
# Per-metric pass thresholds (override defaults)
metric_thresholds:
answer_relevancy: 0.7
hallucination: 0.9
faithfulness: 0.8
# Execution controls
execution:
concurrency: 3 # parallel case evaluations
retries: 2 # retry on transient errors
timeout: 30 # per-case timeout (seconds)The judge LLM is intentionally decoupled from the agent’s own LLM — it scores the agent’s output independently.
| Field | Type | Default | Description |
|---|---|---|---|
eval_engine.framework | string | deepeval | Adapter to use |
eval_engine.judge.provider | string | gemini | LLM provider for scoring |
eval_engine.judge.model | string | gemini-2.0-flash | Model name for the judge |
eval_engine.judge.api_key_env | string | GOOGLE_API_KEY | Env var holding the API key |
eval_engine.metric_thresholds | map | {} | Per-metric pass thresholds (0.0–1.0) |
eval_engine.execution.concurrency | int | 1 | Max parallel case evaluations |
eval_engine.execution.retries | int | 0 | Retry count on transient errors |
eval_engine.execution.timeout | int | 60 | Per-case timeout in seconds |
Suite definition
# evals/baseline.yaml
eval_suite:
name: baseline
pass_threshold: 0.85
metrics:
- answer_relevancy
cases:
- file: ./cases/accuracy.yaml
weight: 0.5
- file: ./cases/tool_use.yaml
weight: 0.3
- file: ./cases/escalation.yaml
weight: 0.2| Field | Type | Required | Description |
|---|---|---|---|
eval_suite.name | string | Yes | Suite identifier |
eval_suite.pass_threshold | float | Yes | Minimum weighted score to pass (0.0–1.0) |
eval_suite.metrics | string[] | No | LLM-as-judge metrics applied to all cases by default |
eval_suite.cases[].file | string | Yes | Path to the eval case file |
eval_suite.cases[].weight | float | Yes | Weight in the overall score (should sum to 1.0) |
Writing eval cases
Each case file is a YAML document with test inputs and expected outputs:
Accuracy cases
# evals/cases/accuracy.yaml
name: accuracy
cases:
- input: "What is Zil?"
expected_output: "Zil is a framework for production AI agents"
expected_contains:
- "framework"
- "agent"
context:
- "Zil is an open-source framework by FluentData for building production AI agents"
- input: "Hello"
expected_contains:
- "hello"
metrics: [] # deterministic only — no LLM judge neededTool use cases
# evals/cases/tool_use.yaml
name: tool_use
cases:
- input: "Look up order #12345"
expected_tool: lookup_order
expected_contains:
- "order"Escalation cases
# evals/cases/escalation.yaml
name: escalation
cases:
- input: "I want to talk to a human"
expected_action: escalate
- input: "This is urgent, connect me to a manager"
expected_action: escalateCase fields
| Field | Type | Description |
|---|---|---|
input | string | The user message to test |
expected_output | string | Full expected response (for semantic comparison) |
expected_contains | string[] | Strings that must appear in the agent’s response |
expected_tool | string | Tool name the agent should invoke |
expected_action | string | Action the agent should take (e.g., escalate) |
context | string[] | Ground-truth context (for faithfulness/hallucination metrics) |
metrics | string[] | Per-case metric override (empty [] = deterministic only) |
Available metrics
When using the DeepEval adapter, these LLM-as-judge metrics are available:
| Metric | What it measures |
|---|---|
answer_relevancy | Is the response relevant to the input? |
hallucination | Does the response contain fabricated information? |
faithfulness | Is the response grounded in the provided context? |
contextual_relevancy | Is the retrieved context relevant to the query? |
toxicity | Does the response contain harmful content? |
bias | Does the response exhibit bias? |
How scoring works
Each case is scored pass/fail. The suite score is:
suite_score = Σ (case_group_pass_rate × weight)If suite_score >= pass_threshold, the suite passes.
Running evals
zil eval run
# Run the baseline suite
zil eval run
# Verbose output with per-case details
zil eval run --verbose
# JSON output for CI pipelines
zil eval run --json-output
# Override threshold
zil eval run --threshold 0.9
# Run a specific suite
zil eval run --suite regressionAs a pack gate
# Normal: runs evals, gates on threshold
zil pack
# Development: skip evals (warns loudly)
zil pack --skip-evalsWhen evals fail:
→ Running pre-flight evals... ✗ 72.1% (threshold: 85%)
Error: Eval suite failed. Fix eval failures before packaging.Installation
The eval engine requires the [eval] optional extra:
pip install 'zil-ai[eval]'For deterministic-only evals (expected_contains, expected_tool, expected_action), no additional dependencies are needed beyond the base zil-ai package.
Best practices
- Start with 3–5 cases per category, expand as you find edge cases
- Weight accuracy highest — it’s the most fundamental quality signal
- Add cases for every production bug — turn incidents into regression tests
- Keep the threshold at 0.85 unless you have a strong reason to change it
- Use
contextfields for faithfulness metrics — the judge needs ground truth - Override metrics per-case with
metrics: []for simple deterministic checks - Never use
--skip-evalsin CI — it exists for local development only - Use
--json-outputin CI pipelines for machine-readable results