The guardrails your AI agents are missing
Fairvisor protects your LLM spend with token-based rate limiting, agentic loop detection, and budget circuit breakers. Built for teams shipping AI agents to production.
What Fairvisor Does for AI Teams
Token-Based Rate Limiting (TPM/TPD)
Limit tokens per minute or per day — the same units your LLM provider bills you. Not requests. Not bytes. Tokens. Prompt tokens counted before the request is sent; completion tokens counted during streaming. → LLM Token Limiter docsAgentic Loop Detection
When an agent sends the same request 10 times in 60 seconds, it’s not being thorough — it’s stuck. Fairvisor fingerprints requests and detects repetition patterns in real time. Returns 429 withX-Fairvisor-Reason: loop_detected. → Loop Detector docs
Cost-Based Budgets
Set a daily or hourly budget per org, team, or user. Fairvisor tracks spend in real time and enforces staged actions: warn at 80%, throttle at 95%, reject at 100%. → Cost-Based Budget docsBudget Circuit Breaker
If spend rate exceeds 5–10x the baseline, the circuit breaker trips automatically. Protects against agent bugs that generate thousands of expensive requests, prompt injection triggering recursive tool calls, and cascading failures.Mid-Stream Enforcement
Fairvisor counts tokens during SSE streaming. If a completion exceeds yourmax_completion_tokens mid-stream, Fairvisor gracefully closes the stream with finish_reason: length. No corrupt responses. No wasted tokens after the cutoff.
Shadow Mode
Run any policy against live traffic before enforcing it. See exactly which requests would be blocked, throttled, or budgeted — without affecting your users. Tune against real production patterns, then promote to enforcement. → Shadow mode docsWhat a Runaway Loop Looks Like
The pattern is always the same: an agent gets stuck, retries the same call, and retry logic makes it worse.
3:00am — Agent starts a scheduled data enrichment job. First call succeeds.
3:02am — A downstream API returns a transient error. Agent retries.
3:05am — Retry logic is misconfigured. Agent retries 200 times per minute.
3:15am — Monitoring threshold triggers. Alert queued.
3:20am — PagerDuty fires.
3:35am — On-call engineer wakes up and kills the process. $8,000 spent.
With Fairvisor: loop detected at request 10. Budget circuit breaker trips. Agent gets a 429. Total damage: under $1.
How It Integrates
Your App → [Fairvisor Edge] → OpenAI / Anthropic / Azure OpenAI / vLLM / Ollama
Works with OpenAI-compatible APIs in standard gateway/proxy integration patterns. Most teams deploy without app-level rewrites, but exact integration depends on existing gateway/client setup. → Deployment guide
Compatible with:
- OpenAI API
- Azure OpenAI
- Anthropic (Claude)
- Google Gemini
- vLLM, Ollama, LiteLLM
- Any OpenAI-compatible endpoint
Who This Is For
- Engineering teams shipping AI agents to production
- ML platform teams managing LLM spend across multiple models and orgs
- AI product teams building multi-agent or agentic workflows
- Startups where a single runaway agent can hit the monthly budget in hours
FAQ
How does token-based rate limiting work?
Prompt tokens are counted before the request is forwarded to the LLM; completion tokens are counted during SSE streaming. Limits apply by tokens/minute (TPM), tokens/day (TPD), or cost/day — the same units your provider bills you. Not request counts that ignore the cost difference between a 100-token and a 100,000-token prompt.What is agentic loop detection?
Fairvisor fingerprints each request and tracks repetition patterns across a sliding window. When an agent sends near-identical requests 10 times within 60 seconds, it returns a 429 withX-Fairvisor-Reason: loop_detected. Thresholds are configurable. The fingerprint covers prompt structure and content hash — not just exact string match.
What happens when a budget is exhausted?
Staged enforcement: warn header at 80%, throttle (200–500ms delay) at 95%, reject with 429 at 100%. The budget circuit breaker also trips automatically if spend rate spikes 5–10x the recent baseline — catching runaway loops before the daily budget is drained.Does it work with all LLM providers?
Works with OpenAI-compatible endpoints such as OpenAI, Azure OpenAI, Anthropic-compatible gateways, Gemini-compatible gateways, vLLM, Ollama, and LiteLLM. Integration details depend on each provider/proxy path.Can I set different limits per team or user?
Yes. Limits are scoped to any combination of org, team, user, or endpoint. Each gets isolated counters. One team hitting their budget doesn’t affect others. → Cost-Based Budget docsDoes Fairvisor add latency to my LLM calls?
Fairvisor is designed for sub-millisecond decisioning in typical deployments. Actual overhead depends on deployment topology and traffic profile.How does mid-stream token enforcement work?
Fairvisor counts completion tokens during SSE streaming. If a completion exceeds yourmax_completion_tokens limit mid-stream, Fairvisor closes the stream gracefully with finish_reason: length. No truncated JSON, no corrupted responses — a clean, handled cutoff.
What is the budget circuit breaker?
If spend rate exceeds 5–10x the recent baseline, the circuit breaker trips automatically — even if the daily budget hasn’t been reached. Catches runaway agent loops before they drain the budget. → Cost-Based Budget docsWhy teams choose Fairvisor
Token limits that match how you're billed
RPS limits don’t help when one request costs $47. Fairvisor limits by tokens and cost — the same units your provider charges.Catches loops before they hit your card
Fingerprint-based detection stops recursive agent calls in real time — not 20 minutes later on a PagerDuty alert.Works with your existing LLM client
OpenAI-compatible error format helps existing SDK retry paths continue to work. Most integrations avoid application rewrites.Comparison: Fairvisor vs. Alternatives
| Feature | Fairvisor | LiteLLM | Helicone | Cloudflare AI Gateway | Apigee |
|---|---|---|---|---|---|
| Token-based controls | Yes | Yes (quotas/budgets) | Partial | Yes | Yes |
| Agentic loop detection (native) | Yes | No | No | No | No |
| Cost budgets (enforcement) | Yes | Partial | Partial | Partial | Partial |
| Budget circuit breaker (native) | Yes | No | No | No | No |
| Mid-stream token enforcement | Yes | No | No | No | No |
| Self-hosted option | Yes | Yes | Partial | No | No |
| Shadow / dry-run capability | Yes | Partial | No | No | Partial |
Stop your agents from burning money while you sleep
Deploy in 5 minutesAlso relevant
For FinOps
Real-time cost attribution and budget enforcement by tenant, team, and endpoint.
For LLM Providers
Anti-extraction controls, identity-aware enforcement, and forensics at the inference layer.
For Platform Engineering
Policy-as-config, GitOps-native, Kubernetes-ready rate limiting infrastructure.