Engineers App

Alert Routing — Tier Mode, Policy Mode, and On-Call

Three separate systems decide who gets paged: routing mode (tier or policy), the on-call schedule, and severity filters. This is what each one does and how they compose.

The three systems#

When an incident opens, Reliable answers three questions in order:

  1. Audience — who is a candidate to be paged? (Decided by your project's routing mode: tier or policy.)
  2. On-call filter — of those candidates, who is actually on shift right now?
  3. Severity filter — of those on-shift engineers, who has opted in to alerts at this severity level?

Each project picks one routing mode. The on-call schedule and severity filter then sit on top of whichever mode you chose — they apply to both.

Default for new projects

New projects start in tier mode with the on-call schedule disabled. That means zero configuration: add engineers, set their tier, and pages flow. Switch to policy mode (and turn on schedules) when you outgrow the simple ladder.

Why two systems? Why not just policy?#

The routing mode and the on-call schedule answer different questions, which is why they're separate:

  • Routing mode answers "who is the right group of people to handle this kind of incident?" That's a property of the alert (payment errors → payments team, DB errors → infra team).
  • On-call schedule answers "of that group, who is awake / on shift right now?" That's a property of time (the payments team has 4 engineers; only one is the 3am on-call).

A hospital is the cleanest parallel. The department (cardiology, ER) is the policy — it decides which specialists could handle a case. The shift roster decides which one of them is in the building right now. You need both: routing without scheduling pages your CTO at 3am for every transient error; scheduling without routing pages whoever happens to be on shift, even if they have nothing to do with the failing service.

Collapsing them into one config would force every policy step to carry its own rotation, multiplying complexity, and would break the severity bypass — "for SEV1, page everyone in the audience regardless of who's on shift" only makes sense when the two layers are independent. Keeping them separate is the same model PagerDuty and Opsgenie use, for the same reason.

Rule of thumb

If you're asking "which TEAM should respond", you're configuring the routing mode (tier or policy). If you're asking "who from that team is reachable RIGHT NOW", you're configuring the on-call schedule. The two never overlap.

Tier mode#

Tier mode is a straight escalation ladder. Each engineer is tagged with a tier — tier_1, tier_2, tier_3, or lead. When an incident fires:

  1. Page all tier_1 engineers (step 1).
  2. If nobody acks within the timeout (default 20 min), escalate to tier_2.
  3. Then tier_3, then lead.
  4. Past lead, the engine stops escalating — the incident stays open and unacked until someone manually intervenes.

There's no config beyond setting tiers on engineers. Best fit: small teams, one rotation, you want it to just work.

Policy mode#

Policy mode replaces the fixed tier ladder with named policies. A policy has ordered steps; each step targets either a tag (e.g. all engineers tagged backend) or a specific engineer. Each step has its own timeout and minimum severity.

One policy per project is marked as the default policy — all incidents use it unless overridden. Steps can also be flagged notify_only (observers — paged in parallel, never block escalation, no ack required) for stakeholders who want awareness without responsibility.

text
Example: "Payment Service" policy

  Step 1 (5 min)   → tag: payments-oncall
  Step 2 (10 min)  → tag: backend-leads
  Step 3 (15 min)  → engineer: cto@company.com
  Step 4 (observer) → tag: managers      ← always paged, never blocks

Best fit: multiple rotations, route by service / team, observer notifications, custom escalation timelines per service.

The on-call schedule#

The on-call schedule is separate from routing mode. It works the same way whether you're in tier or policy mode: after the audience is resolved, the schedule narrows it to whoever is actually on shift right now.

  • Schedule disabled (project default): no filtering. The whole audience gets paged regardless of time of day.
  • Schedule enabled, someone on shift: only engineers currently on the rotation receive the page.
  • Schedule enabled, nobody on shift: everyone in the original audience gets paged — the system never silently drops an alert because of a roster gap.

A bypass threshold lets you say "page everyone regardless of schedule when severity is at least X". Default bypass is critical — SEV1s wake the whole team.

The severity filter#

The last filter is per-engineer (tier mode) or per-step (policy mode):

  • Tier mode: each engineer's min_severity preference (defaulting to project-wide setting, defaulting to medium). Incidents below that severity skip the engineer.
  • Policy mode: each step has its own min_severity. Steps below threshold are skipped (the engine tries the next step in the same call — no wasted timeout cycle).

Pre-2026-05: this default was 'high', not 'medium'

Older projects with engineers still on the original min_severity = high default may not receive pages for SDK-reported medium errors. The AI severity classifier upgrades severity post-ingest and re-dispatches paging when an incident crosses the high threshold, but the cleanest fix is to drop each engineer's preference to medium in their notification settings.

A walkthrough: error → page#

Suppose your frontend fires an error. Here's the trace:

  1. The SDK sends the error to /api/v1/ingest/errors with severity medium (the SDK's default).
  2. The backend opens an incident. If your project is in policy mode AND has a default policy set, the incident is attached to that policy. Otherwise it goes through tier mode.
  3. The audience resolver picks candidates: all tier_1 engineers (tier mode) or the first matching policy step's targets (policy mode).
  4. If on-call schedules are enabled, only currently-rostered candidates remain (or everyone, if nobody is on shift — see above).
  5. The severity filter drops anyone whose min_severity is above medium.
  6. What's left receives the page — full-screen takeover on the mobile app, plus every other enabled channel (Slack, email, webhook).
  7. Meanwhile, the AI severity classifier runs asynchronously. If it upgrades the incident past high AND nobody was paged yet (because the severity filter ate them), it triggers a second dispatch — that's the safety net for the under-reported case.

Common confusions#

"Why did nobody get paged?" Most often: severity filter dropped everyone. Check each engineer's min_severity against the incident's reported severity. Run a manual page from On-Call → Page an Engineer to confirm channels work — manual pages bypass all three filters above, so success there isolates the issue to the routing system.

"I'm in policy mode but errors page tier_1 anyway." Likely no default policy is set. If a project is in policy mode but has no default policy, incidents fall through to tier mode (the safe fallback). Mark one policy as Default from Alert Policies.

"Schedule says nobody is on call but I got paged." That's the roster-gap fallback — the system pages everyone in the audience rather than silently dropping the alert. Or your project is using the severity bypass, which pages everyone regardless of schedule above a threshold (default: critical).

"Manual page works but errors don't." Classic symptom of the severity filter. Manual pages skip the audience resolver, on-call filter, and severity filter entirely — they just dispatch the channels for one chosen engineer. If real errors don't page, the gap is upstream in one of those three filters.

When to use what#

  • Tier mode + schedule off: tiny team, everyone paged on every alert. Zero config.
  • Tier mode + schedule on: small team with a shift rotation. Tag your folks tier_1, set up the schedule, you're done.
  • Policy mode + schedule on: multiple services / teams, custom escalation per service, observers. Worth the configuration overhead once you have more than one rotation.
  • Policy mode + schedule off: rare — you have multiple services but no rotation. Usually a transitional state.

You can switch later

Routing mode is one project setting (Settings → Routing Mode). Switching from tier to policy doesn't lose data — your engineers, tiers, and tags stay; you just gain access to the policy editor. Switching back works the same way.