Chapter 10

Operability

Building systems that run well. SLOs, alerts, deploys, observability.

On this page

Build it so you can run it.

Why Operability Matters

You don't just ship code. You ship a system that will run in production, probably at 2 AM, possibly while you're on vacation.

Operability is the quality of a system that makes it easy to run, monitor, and fix when things go wrong. It's not an afterthought. It's part of the design.

Operability means:

You know when something is wrong before users tell you
You can understand what's happening inside the system
You can deploy changes safely
You can recover quickly when things break
You're not dependent on heroics to keep things running

If you can't operate it, you didn't really ship it.

Service Level Objectives (SLOs)

An SLO is a target for how well your service should perform.^[1]

The Basics

SLI (Service Level Indicator): What you measure. Request latency, error rate, availability.

SLO (Service Level Objective): The target for that measurement. "99.9% of requests complete in under 200ms."

Error Budget: How much failure is acceptable. If your SLO is 99.9%, your error budget is 0.1%.

Why SLOs Matter

Without SLOs, you're flying blind. Every incident feels equally urgent. Every performance regression is a crisis.

With SLOs:

You know when you're at risk versus when you're fine
You can make rational tradeoffs about reliability vs. features
You have a shared language with stakeholders about "good enough"

Setting SLOs

Start with user expectations. What does the user actually need? Not "as fast as possible" but "fast enough that they don't notice."

Be realistic. 100% is not achievable. Even 99.99% is extremely hard. Start with something achievable.

Measure what matters. Latency at the 50th percentile is less useful than the 99th. The median user is fine; it's the tail that hurts.

Common SLOs:

Type	What It Measures	Example
Availability	Is it up?	99.9% of requests succeed
Latency	Is it fast?	95% of requests < 200ms, 99% < 500ms
Throughput	Can it handle load?	Sustain 10k requests/second
Error rate	Is it correct?	< 0.1% error rate

Using Error Budgets

If you're within budget, you can take risks. Ship features. Experiment.

If you're burning budget, slow down. Focus on reliability. Pay down operational debt.

The error budget makes tradeoffs explicit. "We can't ship this feature because we've burned 80% of our error budget this month" is a clear, defensible statement.

Alerting

Alerts tell you when something is wrong. Bad alerts tell you constantly, about nothing.

Good Alert Characteristics

Actionable. If the alert fires, there's something a human needs to do. If there's nothing to do, it shouldn't be an alert.

Specific. The alert tells you what's wrong and where to start looking. "Error rate elevated" is better than "something is wrong."

Urgent. It requires attention now. If it can wait until morning, it's not an alert; it's a ticket.

Rare. Alert fatigue is real. If you're woken up every night, you'll start ignoring pages.

The Alert Checklist

Before adding an alert, answer:

[ ] What specific condition triggers this?
[ ] What's the user impact?
[ ] What should the on-call engineer do when this fires?
[ ] Is there a runbook? (There should be. See Runbook template.)
[ ] Is this urgent enough to wake someone up?
[ ] Will this alert frequently enough to cause fatigue?

Paging vs. Ticketing

Page (wake someone up) when:

Users are currently affected
The problem is getting worse
Intervention is required to prevent data loss or extended outage

Create a ticket (handle during business hours) when:

The system is degraded but functional
There's no immediate user impact
It can wait until someone is available

Don't alert at all when:

It's informational
It resolves on its own
There's nothing a human can do

Alert Hygiene

Review alerts regularly. If an alert never fires, is it needed? If it fires constantly, is it tuned correctly?

Tune thresholds. Too sensitive = noise. Too lax = missed incidents.

Delete noisy alerts. An ignored alert is worse than no alert. It trains people to ignore pages.

Update runbooks. When you learn something new from an alert, update the runbook.

Observability

Observability is how well you can understand what's happening inside your system from its outputs.^[2]

The Three Pillars

Logs: What happened. Discrete events with context.

Metrics: How much is happening. Aggregated measurements over time.

Traces: How requests flow through the system. The path from start to finish.

Each serves a different purpose:

Question	Use
"Did this happen?"	Logs
"How often does this happen?"	Metrics
"Where is time spent in this request?"	Traces

Good Logging

Log at the right level:

ERROR: Something broke
WARN: Something concerning that should be investigated
INFO: Significant business events
DEBUG: Details for troubleshooting (off in production by default)

Include context: Request ID, user ID, relevant parameters. You'll need this when debugging.

Structure your logs: JSON or structured format. Makes searching and aggregating possible.

Don't log secrets: No passwords, tokens, PII in logs. Ever.

Good Metrics

Focus on the four golden signals:^[3]

Latency: How long things take
Traffic: How much demand exists
Errors: How often things fail
Saturation: How full your resources are

Label appropriately: Endpoint, status code, service, region. Enough to slice the data meaningfully.

Watch cardinality: Too many unique label values = expensive and slow.

Good Tracing

Trace at boundaries: Service-to-service calls, database queries, external API calls.

Propagate context: Pass trace IDs across service boundaries so you can follow a request end-to-end.

Sample intelligently: You don't need every trace. Sample a percentage, but keep all traces that represent errors or high latency.

Safe Deployments

Deploying is the most common way to break production. Do it safely.

Before You Deploy

The checklist:

[ ] Tests pass (unit, integration, E2E as appropriate)
[ ] Code reviewed and approved
[ ] Migrations are safe (reversible or tested at scale)
[ ] Feature flags in place for risky changes
[ ] Rollback plan documented
[ ] On-call aware
[ ] Monitoring ready to detect problems

Deployment Strategies

Rolling deploy: Replace instances gradually. If something's wrong, stop the rollout.

Blue-green: Run two environments. Switch traffic from old to new. Easy rollback.

Canary: Send a small percentage of traffic to the new version. Watch for problems. Expand gradually.

Feature flags: Deploy the code, but keep the feature off. Turn it on gradually. Turn it off instantly if problems occur.

Migration Safety

Migrations are often irreversible. Treat them with care.

Before migrating:

[ ] Migration is backwards-compatible with current code
[ ] Migration has been tested on production-like data
[ ] Rollback strategy is documented (even if it's "restore from backup")
[ ] Estimated runtime is acceptable for production

Patterns for safe migrations:

Add new column/table (backwards-compatible)
Deploy code that writes to both old and new
Backfill new column/table
Deploy code that reads from new
Remove old column/table (now safe)

This multi-step approach keeps each deploy reversible.

Backfills

Backfilling data at scale can be dangerous.

Make backfills:

Idempotent: Running twice produces the same result
Throttled: Don't overwhelm the database
Resumable: Can pick up where you left off if interrupted
Monitored: You know how far along you are

Feature Flags

Feature flags are your safety net.

Use flags for:

New features being tested
Risky changes that might need quick rollback
Gradual rollouts to subset of users
A/B testing

Flag hygiene:

Remove flags once the feature is stable
Document what each flag does
Have a kill switch (flag that turns off everything new)

On-Call

Someone has to be responsible when production breaks. Do it sustainably.

On-Call Hygiene

Runbooks exist and are current. You shouldn't need tribal knowledge to respond.

Escalation paths are clear. Who do you call when you're stuck?

Handoffs are real. Brief the next person. Don't just throw them the pager.

Load is sustainable. If you're woken up every night, something is broken.

Follow-up happens. If an alert fires, there should be action to prevent it from recurring.

The On-Call Rotation

Rotate fairly. Everyone takes a turn. Seniority doesn't exempt you.

Compensate appropriately. On-call is work. Treat it that way.

Protected time after incidents. If you were up until 3 AM, you're not at your desk at 9 AM.

Post-incident review. Every significant incident gets a post-mortem. Learn and improve.

Operational Excellence Checklist

Before shipping a new service (or auditing an existing one):

Monitoring

[ ] Key metrics are tracked (latency, errors, saturation)
[ ] Dashboards exist and are useful
[ ] SLOs are defined
[ ] Alerts are configured and tested
[ ] Runbooks exist for each alert

Deployment

[ ] Deployment is automated
[ ] Rollback is tested and documented
[ ] Canary or staged rollout is possible
[ ] Feature flags are in place for risky changes

Reliability

[ ] Single points of failure are identified and mitigated
[ ] Graceful degradation is implemented where possible
[ ] Rate limiting protects against overload
[ ] Circuit breakers prevent cascade failures

Documentation

[ ] Architecture is documented
[ ] Dependencies are documented
[ ] On-call runbook exists
[ ] Incident response process is documented

Build it so you can run it. Monitor so you know when it's broken. Deploy so you can fix it fast. Operability is part of the craft.

SLOs were formalized by Google's Site Reliability Engineering practice. See Site Reliability Engineering (2016) and The Site Reliability Workbook (2018) for comprehensive treatment. ↩︎
The term "observability" was introduced to software from control theory by Charity Majors and others at Honeycomb. See Observability Engineering (2022) by Majors, Fong-Jones, and Miranda. ↩︎
The four golden signals are from Google's Site Reliability Engineering book (2016), Chapter 6: "Monitoring Distributed Systems." ↩︎

Search Manual