Incident Runbooks for Rate-Limit Reject Spikes

Open Documentation

A good starting point is the decision contract. In decision service mode, the gateway calls POST /v1/decision with original request context, then uses the response status and headers to continue or block traffic. It sounds straightforward, but most real failures happen at integration boundaries: an upstream normalizes headers unexpectedly, a route omits original URI forwarding, or timeout budgets are tuned for happy-path latency and collapse during dependency jitter. The fix is usually the same: make these assumptions explicit in configuration and test them as first-class behavior, not as an afterthought.

Policy quality matters just as much as integration quality. A valid policy bundle can still behave poorly if descriptor keys are noisy, spoofable, or too broad for tenant isolation goals. For this topic, the best pattern is to keep selectors narrow at first, define limit_keys that map to business identity rather than transport accidents, and choose algorithms based on failure domain rather than habit. Use request-rate controls to protect immediate capacity, spend-window controls to cap financial exposure, and escalation paths that avoid sudden user-impact cliffs. Avoid writing policy that cannot be explained clearly to on-call engineers in a single incident timeline.

Rollout discipline is where teams either build confidence or accumulate technical debt. Start with constrained scope, validate reason distributions, and measure key cardinality before expanding coverage. If risk is uncertain, use shadow-style observation first so you can evaluate would-reject behavior without blocking traffic. Promotion should require explicit gates: stable gateway health, predictable decision outcomes, and no unresolved unknowns in reject reason clusters. This makes enforcement a controlled release process instead of a one-shot config push.

In day-to-day operations, treat reject traffic as product behavior, not infrastructure noise. Each dominant reason code should map to a known operator action, customer-facing expectation, and communication path. Gateway-to-edge dependency health should be monitored separately from limiter saturation so incident triage can distinguish service reachability from intentional policy control. Runbooks should include fast rollback paths, scoped overrides, and ownership boundaries across platform, security, and product teams. The goal is not only to prevent outages but to preserve trust in the control plane during high-pressure events.

Over time, reliability comes from governance, not heroics. Keep bundle versioning monotonic, require reviewable diffs for selector and key changes, and verify rollback artifacts before every major promotion. Revisit this topic quarterly with real incident data, not only benchmark snapshots. When teams institutionalize that loop, Incident Runbooks for Rate-Limit Reject Spikes stops being a one-time implementation task and becomes a durable operating capability.

Read the Fairvisor docs before production rollout

Open Documentation

LLM-friendly markdown version