Engineers App
Alert Routing — Tier Mode, Policy Mode, and On-Call
Three separate systems decide who gets paged: routing mode (tier or policy), the on-call schedule, and severity filters. This is what each one does and how they compose.
The three systems#
When an incident opens, Reliable answers three questions in order:
- Audience — who is a candidate to be paged? (Decided by your project's routing mode: tier or policy.)
- On-call filter — of those candidates, who is actually on shift right now?
- Severity filter — of those on-shift engineers, who has opted in to alerts at this severity level?
Each project picks one routing mode. The on-call schedule and severity filter then sit on top of whichever mode you chose — they apply to both.
Default for new projects
Why two systems? Why not just policy?#
The routing mode and the on-call schedule answer different questions, which is why they're separate:
- Routing mode answers "who is the right group of people to handle this kind of incident?" That's a property of the alert (payment errors → payments team, DB errors → infra team).
- On-call schedule answers "of that group, who is awake / on shift right now?" That's a property of time (the payments team has 4 engineers; only one is the 3am on-call).
A hospital is the cleanest parallel. The department (cardiology, ER) is the policy — it decides which specialists could handle a case. The shift roster decides which one of them is in the building right now. You need both: routing without scheduling pages your CTO at 3am for every transient error; scheduling without routing pages whoever happens to be on shift, even if they have nothing to do with the failing service.
Collapsing them into one config would force every policy step to carry its own rotation, multiplying complexity, and would break the severity bypass — "for SEV1, page everyone in the audience regardless of who's on shift" only makes sense when the two layers are independent. Keeping them separate is the same model PagerDuty and Opsgenie use, for the same reason.
Rule of thumb
Tier mode#
Tier mode is a straight escalation ladder. Each engineer is tagged with a tier — tier_1, tier_2, tier_3, or lead. When an incident fires:
- Page all
tier_1engineers (step 1). - If nobody acks within the timeout (default 20 min), escalate to
tier_2. - Then
tier_3, thenlead. - Past
lead, the engine stops escalating — the incident stays open and unacked until someone manually intervenes.
There's no config beyond setting tiers on engineers. Best fit: small teams, one rotation, you want it to just work.
Policy mode#
Policy mode replaces the fixed tier ladder with named policies. A policy has ordered steps; each step targets either a tag (e.g. all engineers tagged backend) or a specific engineer. Each step has its own timeout and minimum severity.
One policy per project is marked as the default policy — all incidents use it unless overridden. Steps can also be flagged notify_only (observers — paged in parallel, never block escalation, no ack required) for stakeholders who want awareness without responsibility.
Example: "Payment Service" policy
Step 1 (5 min) → tag: payments-oncall
Step 2 (10 min) → tag: backend-leads
Step 3 (15 min) → engineer: cto@company.com
Step 4 (observer) → tag: managers ← always paged, never blocksBest fit: multiple rotations, route by service / team, observer notifications, custom escalation timelines per service.
The on-call schedule#
The on-call schedule is separate from routing mode. It works the same way whether you're in tier or policy mode: after the audience is resolved, the schedule narrows it to whoever is actually on shift right now.
- Schedule disabled (project default): no filtering. The whole audience gets paged regardless of time of day.
- Schedule enabled, someone on shift: only engineers currently on the rotation receive the page.
- Schedule enabled, nobody on shift: everyone in the original audience gets paged — the system never silently drops an alert because of a roster gap.
A bypass threshold lets you say "page everyone regardless of schedule when severity is at least X". Default bypass is critical — SEV1s wake the whole team.
The severity filter#
The last filter is per-engineer (tier mode) or per-step (policy mode):
- Tier mode: each engineer's
min_severitypreference (defaulting to project-wide setting, defaulting tomedium). Incidents below that severity skip the engineer. - Policy mode: each step has its own
min_severity. Steps below threshold are skipped (the engine tries the next step in the same call — no wasted timeout cycle).
Pre-2026-05: this default was 'high', not 'medium'
min_severity = high default may not receive pages for SDK-reported medium errors. The AI severity classifier upgrades severity post-ingest and re-dispatches paging when an incident crosses the high threshold, but the cleanest fix is to drop each engineer's preference to medium in their notification settings.A walkthrough: error → page#
Suppose your frontend fires an error. Here's the trace:
- The SDK sends the error to
/api/v1/ingest/errorswith severitymedium(the SDK's default). - The backend opens an incident. If your project is in policy mode AND has a default policy set, the incident is attached to that policy. Otherwise it goes through tier mode.
- The audience resolver picks candidates: all
tier_1engineers (tier mode) or the first matching policy step's targets (policy mode). - If on-call schedules are enabled, only currently-rostered candidates remain (or everyone, if nobody is on shift — see above).
- The severity filter drops anyone whose
min_severityis abovemedium. - What's left receives the page — full-screen takeover on the mobile app, plus every other enabled channel (Slack, email, webhook).
- Meanwhile, the AI severity classifier runs asynchronously. If it upgrades the incident past
highAND nobody was paged yet (because the severity filter ate them), it triggers a second dispatch — that's the safety net for the under-reported case.
Common confusions#
"Why did nobody get paged?" Most often: severity filter dropped everyone. Check each engineer's min_severity against the incident's reported severity. Run a manual page from On-Call → Page an Engineer to confirm channels work — manual pages bypass all three filters above, so success there isolates the issue to the routing system.
"I'm in policy mode but errors page tier_1 anyway." Likely no default policy is set. If a project is in policy mode but has no default policy, incidents fall through to tier mode (the safe fallback). Mark one policy as Default from Alert Policies.
"Schedule says nobody is on call but I got paged." That's the roster-gap fallback — the system pages everyone in the audience rather than silently dropping the alert. Or your project is using the severity bypass, which pages everyone regardless of schedule above a threshold (default: critical).
"Manual page works but errors don't." Classic symptom of the severity filter. Manual pages skip the audience resolver, on-call filter, and severity filter entirely — they just dispatch the channels for one chosen engineer. If real errors don't page, the gap is upstream in one of those three filters.
When to use what#
- Tier mode + schedule off: tiny team, everyone paged on every alert. Zero config.
- Tier mode + schedule on: small team with a shift rotation. Tag your folks
tier_1, set up the schedule, you're done. - Policy mode + schedule on: multiple services / teams, custom escalation per service, observers. Worth the configuration overhead once you have more than one rotation.
- Policy mode + schedule off: rare — you have multiple services but no rotation. Usually a transitional state.
You can switch later