Appendix

Templates

Copy. Paste. Adapt. Ship.

These templates encode best practices into reusable formats. They're intentionally minimal. Add what you need; remove what you don't.


Pull Request Description

## Summary

[1-2 sentences: What does this PR do and why?]

## Changes

- [Bullet list of key changes]

## Testing

- [ ] Unit tests pass
- [ ] Integration tests pass
- [ ] Manual testing completed
- [ ] [Other relevant checks]

## Notes for Reviewers

[Anything reviewers should know: areas of uncertainty,
specific feedback requested, context that's not obvious]

## Related

- Closes #[issue]
- Depends on #[PR] (if applicable)
- ADR: [link] (if applicable)

Architecture Decision Record (ADR)

# ADR-[number]: [Title]

**Status:** [Proposed | Accepted | Deprecated | Superseded by ADR-X]
**Date:** [YYYY-MM-DD]
**Deciders:** [List of people involved]

## Context

[What is the situation? What problem are we solving?
What constraints exist? 2-4 sentences.]

## Decision

[What did we decide? State it clearly in 1-2 sentences.]

## Options Considered

### Option A: [Name]
- **Optimizes for:** [What this does well]
- **Makes harder:** [What this costs]
- **Change cost:** [Low/Medium/High]

### Option B: [Name]
- **Optimizes for:** [What this does well]
- **Makes harder:** [What this costs]
- **Change cost:** [Low/Medium/High]

[Add more options as needed]

## Consequences

**What we gain:**
- [Benefit 1]
- [Benefit 2]

**What we give up:**
- [Tradeoff 1]
- [Tradeoff 2]

**What this means for future work:**
- [Constraint or implication]

## References

- [Links to related docs, research, prior art]

Incident Update

Use this template for communicating during an active incident. Post every 30 minutes (or at significant changes).

## Incident Update — [Timestamp]

**Status:** [Investigating | Identified | Monitoring | Resolved]
**Severity:** [P1/P2/P3]
**Duration:** [X minutes/hours since first alert]

### Current State
[1-2 sentences: What is happening right now? What's the user impact?]

### What We Know
- [Fact 1]
- [Fact 2]

### What We're Doing
- [Current action 1]
- [Current action 2]

### Next Update
[Time of next update, or "when status changes"]

---
Incident Commander: [Name]
Technical Lead: [Name]

Postmortem

# Postmortem: [Incident Title]

**Date of Incident:** [YYYY-MM-DD]
**Duration:** [Start time - End time, total duration]
**Severity:** [P1/P2/P3]
**Author:** [Name]
**Status:** [Draft | In Review | Complete]

## Summary

[2-3 sentences: What happened? What was the impact?]

## Timeline

All times in [timezone].

| Time | Event |
|------|-------|
| HH:MM | [First alert / symptom] |
| HH:MM | [Key event] |
| HH:MM | [Mitigation applied] |
| HH:MM | [Resolved] |

## Impact

- **Users affected:** [Number or percentage]
- **Duration of impact:** [Time]
- **Business impact:** [Revenue, reputation, SLA, etc.]

## Root Cause

[What actually caused the incident? Be specific.
This is not "human error" — dig deeper.]

## Contributing Factors

- [Factor 1: What made this possible or worse?]
- [Factor 2]

## What Went Well

- [Thing that worked 1]
- [Thing that worked 2]

## What Went Poorly

- [Thing that didn't work 1]
- [Thing that didn't work 2]

## Action Items

| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| [Specific action] | [Name] | [P1/P2/P3] | [Date] | [Open/Done] |

## Lessons Learned

[What did we learn that applies beyond this incident?]

## References

- [Link to incident channel]
- [Link to relevant dashboards]
- [Link to related ADRs or docs]

Runbook

# Runbook: [Service/Process Name]

**Last Updated:** [Date]
**Owner:** [Team/Person]

## Overview

[1-2 sentences: What is this service/process? Why does it exist?]

## Common Symptoms

| Symptom | Likely Cause | Quick Action |
|---------|--------------|--------------|
| [What you see] | [Why it happens] | [What to try first] |

## Diagnostic Steps

### 1. Check Service Health
```bash
[command to check health]
```
Expected output: [what healthy looks like]

### 2. Check Logs
```bash
[command to check logs]
```
Look for: [what to look for]

### 3. Check Dependencies
- [Dependency 1]: [how to check]
- [Dependency 2]: [how to check]

## Mitigation Steps

### Restart Service
```bash
[restart command]
```
⚠️ **Safe?** [Yes / Yes with caveats / No — explain]

### Scale Up
```bash
[scale command]
```

### Rollback

⚠️ **Before rolling back, verify:**
- [ ] No irreversible migrations ran
- [ ] No side effects that can't be undone (payments, emails, etc.)
- [ ] Rollback target is known-good

```bash
[rollback command]
```

### Failover
[Steps for failover if applicable]

## Escalation

| Level | Contact | When |
|-------|---------|------|
| L1 | [Team/person] | [First response] |
| L2 | [Team/person] | [If L1 can't resolve in X minutes] |
| L3 | [Team/person] | [If major impact or data loss risk] |

## Related

- Dashboard: [link]
- Logs: [link]
- Alerts: [link]
- Architecture docs: [link]

Technical Debt Register Entry

## Debt: [Short Title]

**ID:** DEBT-[number]
**Created:** [Date]
**Owner:** [Team/Person]
**Status:** [Active | Scheduled | Paid | Accepted]

### Description

[What is the debt? Be specific. 2-4 sentences.]

### Type

- [ ] Deliberate (we chose this tradeoff knowingly)
- [ ] Accidental (we discovered it later)
- [ ] Bit rot (accumulated over time)
- [ ] Environmental (tooling, dependencies, etc.)

### Impact

**Current cost:**
- [How it hurts us now: slower development, operational burden, etc.]

**Risk if unpaid:**
- [What gets worse over time or what could go wrong]

### Payback Trigger

When should we pay this debt? (Check all that apply)

- [ ] Before we can [specific feature/capability]
- [ ] When [metric] reaches [threshold]
- [ ] If [risk event] occurs
- [ ] During [scheduled maintenance window]
- [ ] When touching [related area]

### Estimated Effort

[T-shirt size: S/M/L/XL, or story points, or time range]

### Payback Plan

[Brief description of how to fix it, or link to design doc]

### References

- [Link to code/area affected]
- [Link to related ADR if deliberate]
- [Link to incidents caused by this debt]

Checklist: Safe Deploy

Use before deploying significant changes.

## Pre-Deploy Checklist

### Code Ready
- [ ] PR approved
- [ ] Tests pass (unit, integration, E2E)
- [ ] No known regressions

### Migration Safety
- [ ] No irreversible migrations (or rollback plan documented)
- [ ] Backfills tested at scale (if applicable)
- [ ] Data migration is idempotent

### Observability
- [ ] Key metrics identified
- [ ] Alerts in place
- [ ] Dashboard ready for monitoring

### Rollback Ready
- [ ] Previous version is known-good
- [ ] Rollback command documented
- [ ] Rollback is safe (no migration conflicts)

### Feature Flags (if applicable)
- [ ] New code behind flag
- [ ] Flag defaults to off
- [ ] Kill switch tested

### Communication
- [ ] Team notified
- [ ] On-call aware
- [ ] Stakeholders informed (if user-facing)

## During Deploy
- [ ] Watch metrics during rollout
- [ ] Watch error rates
- [ ] Watch latency

## Post-Deploy
- [ ] Verify expected behavior
- [ ] Monitor for 15-30 minutes
- [ ] Close the loop with stakeholders

Templates are starting points. Adapt them to your context. What matters is that you have a starting point at all.