Cost Tracking
Zil provides built-in token usage metering so you can monitor and enforce budget limits on LLM calls. The SDK tracks raw token counts — input tokens, output tokens, per-model breakdowns, and session totals. Dollar-cost translation is intentionally excluded from the framework and delegated to an external service (i.e., your own billing layer).
Why tokens, not dollars?
LLM pricing changes frequently and varies by provider, region, and contract. Hard-coding prices into the framework would create fragile coupling. Instead, Zil gives you:
- Accurate token counts per request and per session
- Budget enforcement in tokens (hard caps + alert thresholds)
- OpenTelemetry attributes for downstream cost analysis
An external service (like the upcoming Zil Runtime) can combine these token counts with live pricing to produce dollar estimates.
Manifest configuration
Add a spec.cost section to your manifest.yaml:
spec:
cost:
max_tokens_per_request: 8192
max_tokens_per_session: 500000
alert_threshold_pct: 80
track_by_model: trueAll fields are optional:
| Field | Type | Default | Description |
|---|---|---|---|
max_tokens_per_request | integer | — | Hard cap per LLM call. Requests exceeding this are blocked. |
max_tokens_per_session | integer | — | Hard cap on total tokens per agent session. |
alert_threshold_pct | integer | 80 | Emit a warning when session usage reaches this percentage of max_tokens_per_session. |
track_by_model | boolean | true | Maintain per-model token breakdowns in zil.cost.by_model. |
If spec.cost is omitted entirely, cost tracking still runs passively (counting tokens with no enforcement).
SDK usage
Cost tracking is enabled by default when you call create_agent(). After the agent processes requests, you can inspect usage via the zil.cost singleton:
import zil
root_agent = zil.create_agent(
tools=[lookup_order, cancel_order],
)
# ... after the agent processes some requests ...
# Session totals
print(f"Total tokens: {zil.cost.total_tokens}")
print(f"Input tokens: {zil.cost.total_input_tokens}")
print(f"Output tokens: {zil.cost.total_output_tokens}")
print(f"Requests: {zil.cost.request_count}")
# Budget remaining (None if no session limit set)
print(f"Budget remaining: {zil.cost.budget_remaining}")
# Per-model breakdown
for model, counts in zil.cost.by_model.items():
print(f" {model}: {counts.total_tokens} tokens ({counts.request_count} calls)")zil.cost API
| Property | Type | Description |
|---|---|---|
total_tokens | int | Total tokens used this session |
total_input_tokens | int | Total input/prompt tokens |
total_output_tokens | int | Total output/completion tokens |
request_count | int | Number of LLM calls recorded |
budget_remaining | int | None | Tokens left in session budget (None if no limit) |
by_model | dict[str, TokenCounts] | Per-model breakdown |
requests | list[UsageRecord] | Full request history |
config | dict | The raw spec.cost config |
| Method | Description |
|---|---|
reset() | Clear all accumulators (keeps config and limits) |
Disabling cost tracking
agent = zil.create_agent(enable_cost_tracking=False)Budget enforcement
When budget limits are configured, each LLM call returns one of three statuses:
| Status | Meaning |
|---|---|
allowed | Usage recorded normally |
warned | Usage recorded, but session usage crossed the alert threshold |
blocked | Usage rejected — the request exceeded a hard cap |
Per-request blocking
If a single LLM response uses more tokens than max_tokens_per_request, the usage is not recorded and the callback returns blocked:
spec:
cost:
max_tokens_per_request: 4096An LLM call that uses 5,000 tokens would be blocked. This is useful for catching runaway tool-use loops or unexpectedly long completions.
Per-session blocking
If recording a request would push the session total past max_tokens_per_session, it is blocked:
spec:
cost:
max_tokens_per_session: 100000Once the session reaches 100,000 tokens, subsequent requests are rejected until zil.cost.reset() is called or a new session starts.
Alert threshold
A warning fires once when session usage crosses the alert percentage:
spec:
cost:
max_tokens_per_session: 100000
alert_threshold_pct: 80At 80,000 tokens, the callback logs:
WARNING zil.sdk.cost — Token budget alert: 80000 tokens used (80% of 100000 session limit)The alert fires exactly once per session. After reset, it can fire again.
Using the CostCallback directly
The CostCallback is attached to your agent automatically and accessible as agent._zil_cost. You can also use it to manually record usage from custom LLM calls outside the ADK pipeline:
import zil
agent = zil.create_agent(tools=[...])
# Manual recording
result = agent._zil_cost.record(
input_tokens=150,
output_tokens=350,
model="gemini-2.0-flash",
)
print(result.status) # "allowed", "warned", or "blocked"
print(result.total_tokens) # session total after recording
print(result.budget_remaining) # tokens left (or None)Auto-extracting from LLM responses
If you have a raw LLM response object, the callback can extract usage metadata automatically:
# Gemini response (has .usage_metadata.prompt_token_count / .candidates_token_count)
result = agent._zil_cost.record_from_response(gemini_response)
# OpenAI response (has .usage.prompt_tokens / .completion_tokens)
result = agent._zil_cost.record_from_response(openai_response)Returns None if the response doesn’t contain usage metadata.
OpenTelemetry integration
When OTel tracing is active, every CostCallback.record() call emits a span with these attributes:
| Attribute | Type | Description |
|---|---|---|
llm.usage.input_tokens | int | Input tokens for this call |
llm.usage.output_tokens | int | Output tokens for this call |
llm.usage.total_tokens | int | Total tokens for this call |
llm.usage.model | string | Model name |
llm.usage.status | string | allowed, warned, or blocked |
llm.usage.budget_remaining | int | Remaining session budget |
These attributes can be consumed by any OTel-compatible backend (Jaeger, Grafana, Datadog, etc.) for dashboards and alerting.
Validation
zil validate checks your cost configuration:
✓ spec.cost — configured (request=8192, session=500000, alert@80%)It also flags issues:
⚠ spec.cost — max_tokens_per_session (4096) is less than max_tokens_per_request (8192)
⚠ spec.cost — max_tokens_per_request (16384) exceeds resource_limits.max_tokens_per_request (8192)If spec.cost is not present:
⚠ spec.cost — not configured (no token budgets)Inspect
zil inspect shows cost configuration from a packaged archive:
Zil Package: my-agent
Version: 1.0.0
Framework: adk (python)
Cost: request≤8192, session≤500000, alert@80%Example: full setup
# manifest.yaml
apiVersion: zil/v1
kind: Agent
metadata:
name: support-agent
version: 1.0.0
spec:
runtime:
framework: adk
language: python
llm:
adapter: ./adapters/llm.yaml
identity: ./identity
cost:
max_tokens_per_request: 8192
max_tokens_per_session: 200000
alert_threshold_pct: 75
track_by_model: true# support_agent/agent.py
import zil
def handle_ticket(ticket_id: str) -> dict:
"""Look up a support ticket."""
return {"ticket_id": ticket_id, "status": "open"}
root_agent = zil.create_agent(tools=[handle_ticket])
# After processing
print(f"Session used {zil.cost.total_tokens} tokens")
print(f"Budget remaining: {zil.cost.budget_remaining}")
for model, usage in zil.cost.by_model.items():
print(f" {model}: {usage.total_tokens} tokens over {usage.request_count} calls")