Model. Unlike the harness-only integrations, the PydanticAI integration is a full cascade model: a cheap drafter runs first, its response is quality-gated, and only escalates to a powerful verifier when needed. This keeps intelligent cost routing inside the agent loop where PydanticAI already makes model decisions.
Install
Quick Start
How the Cascade Works
Configuration
Domain Policies
Domain policies override cascade behavior for specific topics detected in the query:| Policy | Effect |
|---|---|
direct_to_verifier: True | Skip drafter entirely — verifier handles the full request |
force_verifier: True | Drafter runs (for cost baseline) but always escalates |
quality_threshold: 0.95 | Override the default threshold for this domain |
Features
- Full cascade Model — drop-in replacement for any PydanticAI
Model, not just a callback - Speculative cascading — drafter runs first; verifier only called when quality is insufficient
- Complexity pre-routing — hard/expert queries skip the drafter entirely
- Tool risk gating — high-risk tool calls (e.g.
delete_all) force verifier escalation - Domain policies — per-domain quality thresholds and routing overrides
- Harness integration — cost, latency, energy, and budget enforcement via
cascadeflow.run() - Fail-open — internal errors never break the agent; cascade degrades gracefully
- Streaming —
request_stream()supported with quality gating
Cascade Result
After every call, inspect what happened:Session Metrics
When running insidecascadeflow.run(), the harness tracks:
cost_total: cumulative USD spent (drafter + verifier)budget_remaining: USD left in the budgetstep_count: number of LLM calls (1 if drafter accepted, 2 if escalated)energy_used: total energy unitslatency_used_ms: total latency
Why This Integration Matters
- The cascade sits at the model boundary — the exact place where cost decisions happen
- PydanticAI agents get automatic cost optimization without changing agent logic
- Quality gating ensures cheaper models are only used when they produce good-enough responses
- Budget enforcement, traces, and domain policies all apply inside the agent loop
Limitations
- Streaming uses a non-streaming drafter call for quality gating, then streams the accepted response
- Tool risk classification uses name-based heuristics, not schema analysis