Observability & Reliability Engineering
VerifiedComplete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, buil...
$ Add to .claude/skills/ About This Skill
# Observability & Reliability Engineering
Complete system for building observable, reliable services — from structured logging to incident response to SLO-driven development.
---
Quick Health Check (/16)
Score your current observability posture:
| Signal | Healthy (2) | Weak (1) | Missing (0) | |--------|-------------|----------|-------------| | Structured logging | JSON logs with trace_id correlation | Logs exist but unstructured | Console.log / print statements | | Metrics collection | RED/USE metrics with dashboards | Some metrics, no dashboards | No metrics | | Distributed tracing | Full request path with sampling | Partial traces, key services only | No tracing | | Alerting | SLO-based alerts with runbooks | Threshold alerts, some runbooks | No alerts or all-noise | | Incident response | Defined process with roles + post-mortems | Ad-hoc response, some docs | "Whoever notices fixes it" | | SLOs defined | SLOs with error budgets tracked weekly | Informal availability targets | No reliability targets | | On-call rotation | Structured rotation with escalation | Informal "call someone" | No on-call | | Cost management | Observability budget tracked monthly | Some awareness of costs | No idea what you spend |
12-16: Production-grade. Focus on optimization. 8-11: Foundation exists. Fill the gaps systematically. 4-7: Significant risk. Prioritize alerting + incident response. 0-3: Flying blind. Start with Phase 1 immediately.
---
Phase 1: Structured Logging
Log Architecture
``` Application → Structured JSON → Log Router → Storage → Query Engine ↓ Alert Pipeline ```
Required Fields (Every Log Line)
| Field | Type | Purpose | Example | |-------|------|---------|---------| | `timestamp` | ISO-8601 UTC | When | `2026-02-22T18:30:00.123Z` | | `level` | enum | Severity | `info`, `warn`, `error`, `fatal` | | `service` | string | Which service | `payment-api` | | `version` | string | Which deploy | `v2.3.1` | | `environment` | string | Which env | `production` | | `message` | string | What happened | `Payment processed successfully` | | `trace_id` | string | Request correlation | `abc123def456` | | `span_id` | string | Operation within trace | `span_789` | | `duration_ms` | number | How long | `142` |
Contextual Fields (Add Per Domain)
```yaml # HTTP request context http: method: POST path: /api/v1/orders status: 201 client_ip: 203.0.113.42 # Anonymize in logs if needed user_agent: "Mozilla/5.0..." request_id: "req_abc123"
# Business context business: user_id: "usr_456" tenant_id: "tenant_789" order_id: "ord_012" action: "checkout" amount_cents: 4999 currency: "USD"
# Error context error: type: "PaymentDeclinedError" message: "Card declined: insufficient funds" code: "CARD_DECLINED" stack: "..." # Only in non-production or DEBUG level retry_count: 2 retryable: true ```
Log Level Decision Tree
``` Is the process about to crash? → FATAL (exit after logging)
Did an operation fail that needs human attention? → ERROR (page someone or create ticket)
Did something unexpected happen but we recovered? → WARN (review in daily triage)
Is this a normal business event worth recording? → INFO (audit trail, business metrics)
Is this useful for debugging but noisy in production? → DEBUG (off in prod, on in staging)
Is this only useful when stepping through code? → TRACE (never in production) ```
Log Level Rules
- ERROR means action required — if no one needs to act on it, it's WARN
- INFO is for business events — not internal implementation details
- No logging inside tight loops — aggregate and log summary
- Log at boundaries — API entry/exit, queue consume/publish, DB calls
- Never log secrets — API keys, tokens, passwords, PII (see scrubbing below)
PII & Secret Scrubbing
```yaml scrub_patterns: # Always redact - field_patterns: ["password", "secret", "token", "api_key", "authorization"] action: replace_with_redacted # Hash for correlation without exposure - field_patterns: ["email", "phone", "ssn", "national_id"] action: sha256_hash # Mask partially - field_patterns: ["credit_card", "card_number"] action: mask_last_4 # "**--**-1234" # IP anonymization - field_patterns: ["client_ip", "ip_address"] action: zero_last_octet # 203.0.113.0 ```
Logger Setup (By Language)
Node.js (Pino): ```typescript import pino from 'pino'; import { AsyncLocalStorage } from 'node:async_hooks';
const als = new AsyncLocalStorage<Record<string, string>>();
const logger = pino({ level: process.env.LOG_LEVEL || 'info', formatters: { level: (label) => ({ level: label }), }, mixin: () => als.getStore() ?? {}, redact: ['req.headers.authorization', '*.password', '*.token'], timestamp: pino.stdTimeFunctions.isoTime, });
// Middleware: inject context app.use((req, res, next) => { const ctx = { trace_id: req.headers['x-trace-id'] || crypto.randomUUID(), request_id: crypto.randomUUID(), service: 'payment-api', version: process.env.APP_VERSION, }; als.run(ctx, () => next()); }); ```
Python (structlog): ```python import structlog structlog.configure( processors=[ structlog.contextvars.merge_contextvars, structlog.processors.add_log_level, structlog.processors.TimeStamper(fmt="iso", utc=True), structlog.processors.JSONRenderer(), ], ) log = structlog.get_logger() # Bind context per-request: structlog.contextvars.bind_contextvars(trace_id=trace_id, user_id=user_id) ```
Go (zerolog): ```go log := zerolog.New(os.Stdout).With(). Timestamp(). Str("service", "payment-api"). Str("version", version). Logger() // Per-request: reqLog := log.With().Str("trace_id", traceID).Logger() ```
Log Storage Decision
| Volume | Solution | Retention | Cost | |--------|----------|-----------|------| | <10 GB/day | Loki + Grafana | 30 days hot, 90 days cold | Low | | 10-100 GB/day | Elasticsearch / OpenSearch | 14 days hot, 90 days S3 | Medium | | 100+ GB/day | ClickHouse or Datadog | 7 days hot, 30 days archive | High | | Budget-constrained | Loki + S3 backend | 90 days all cold | Very low |
10 Logging Anti-Patterns
| # | Anti-Pattern | Fix | |---|-------------|-----| | 1 | `log.error(err)` with no context | Always include: what operation, what input, what state | | 2 | Logging request/response bodies | Log only in DEBUG; redact sensitive fields | | 3 | String concatenation in log messages | Use structured fields: `log.info("processed", { order_id, amount })` | | 4 | Catch-and-log-and-rethrow | Log at the boundary where you handle it, not every layer | | 5 | Different log formats per service | Standardize schema across all services | | 6 | No log rotation / retention policy | Set max size + TTL; archive to cold storage | | 7 | Logging inside hot paths | Aggregate: log summary every N items or every interval | | 8 | Missing correlation IDs | Propagate trace_id from first entry point through all services | | 9 | Boolean log levels (`verbose: true`) | Use standard levels with configurable minimum | | 10 | Logging PII in plain text | Implement scrubbing at the logger level |
---
Phase 2: Metrics Collection
The RED Method (Request-Driven Services)
For every service endpoint, track:
| Metric | What | Prometheus Example | |--------|------|--------------------| | Rate | Requests per second | `http_requests_total{method, path, status}` | | Errors | Failed requests per second | `http_requests_total{status=~"5.."}` / total | | Duration | Latency distribution | `http_request_duration_seconds{method, path}` (histogram) |
The USE Method (Infrastructure Resources)
For every resource (CPU, memory, disk, network):
| Metric | What | Example | |--------|------|---------| | Utilization | % resource busy | CPU usage 78% | | Saturation | Queue depth / backpressure | 12 requests queued | | Errors | Resource errors | 3 disk I/O errors |
Golden Signals (Google SRE)
| Signal | Meaning | Source | |--------|---------|--------| | Latency | Time to serve requests | RED Duration | | Traffic | Demand on the system | RED Rate | | Errors | Rate of failed requests | RED Errors | | Saturation | How "full" the service is | USE Saturation |
Metric Types & When to Use Each
| Type | Use Case | Example | |------|----------|---------| | Counter | Things that only go up | Total requests, errors, bytes sent | | Gauge | Current value that goes up/down | Active connections, queue depth, temperature | | Histogram | Distribution of values | Request latency, response size | | Summary | Pre-calculated percentiles | Client-side latency (when you need exact percentiles) |
Rule: Use histograms over summaries in most cases — they're aggregatable across instances.
Naming Conventions
``` # Pattern: <namespace>_<subsystem>_<name>_<unit> http_server_request_duration_seconds http_server_requests_total db_pool_connections_active queue_messages_pending cache_hit_ratio
# Rules: # 1. Use snake_case # 2. Include unit suffix (_seconds, _bytes, _total) # 3. _total suffix for counters # 4. Don't include label names in metric name # 5. Use base units (seconds not milliseconds, bytes not kilobytes) ```
Label Design Rules
| Rule | Why | Example | |------|-----|---------| | Keep cardinality <100 per label | High cardinality kills performance | `status="200"` not `status="200 OK"` | | No user IDs as labels | Unbounded cardinality | Use log correlation instead | | No request paths with IDs | `/api/users/123` creates millions of series | Normalize: `/api/users/:id` | | Max 5-7 labels per metric | Each combo = a time series | `{method, path, status, service}` |
Instrumentation Checklist
```yaml application_metrics: # HTTP layer - http_request_duration_seconds: histogram {method, path, status} - http_request_size_bytes: histogram {method, path} - http_response_size_bytes: histogram {method, path} - http_requests_in_flight: gauge # Business logic - orders_processed_total: counter {status, payment_method} - order_value_dollars: histogram {payment_method} - user_signups_total: counter {source} # Dependencies - db_query_duration_seconds: histogram {query_type, table} - db_connections_active: gauge {pool} - db_connections_idle: gauge {pool} - cache_requests_total: counter {result: hit|miss} - external_api_duration_seconds: histogram {service, endpoint} - external_api_errors_total: counter {service, error_type} # Queue / async - queue_messages_published_total: counter {queue} - queue_messages_consumed_total: counter {queue, status} - queue_processing_duration_seconds: histogram {queue} - queue_depth: gauge {queue} - queue_consumer_lag: gauge {queue, consumer_group}
infrastructure_metrics: # Node exporter / cAdvisor provides these automatically - cpu_usage_percent: gauge {instance} - memory_usage_bytes: gauge {instance} - disk_usage_bytes: gauge {instance, mount} - disk_io_seconds: counter {instance, device} - network_bytes: counter {instance, direction} - container_cpu_usage: gauge {pod, container} - container_memory_usage: gauge {pod, container} ```
Stack Recommendations
| Component | Options | Recommendation | |-----------|---------|----------------| | Collection | Prometheus, OTEL Collector, Datadog Agent | Prometheus (free) or OTEL Collector (vendor-neutral) | | Storage | Prometheus, Thanos, Mimir, VictoriaMetrics | VictoriaMetrics (best cost/perf) or Mimir (Grafana ecosystem) | | Visualization | Grafana, Datadog, New Relic | Grafana (free, extensible) | | Alerting | Alertmanager, Grafana Alerting, PagerDuty | Alertmanager + PagerDuty routing |
---
Phase 3: Distributed Tracing
Trace Architecture
``` Client Request → API Gateway (root span) → Auth Service (child span) → Order Service (child span) → Database Query (child span) → Payment Service (child span) → Stripe API (child span) → Notification Service (child span) → Email Provider (child span) ```
OpenTelemetry Setup
Auto-instrumentation (Node.js): ```typescript // tracing.ts — import BEFORE anything else import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-http': { ignoreIncomingPaths: ['/health', '/ready'] }, '@opentelemetry/instrumentation-express': { enabled: true }, })], serviceName: process.env.OTEL_SERVICE_NAME || 'payment-api', }); sdk.start(); ```
Custom spans for business logic: ```typescript import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service');
async function processPayment(order: Order) { return tracer.startActiveSpan('process-payment', async (span) => { span.setAttributes({ 'order.id': order.id, 'order.amount_cents': order.amountCents, 'payment.method': order.paymentMethod, }); try { const result = await chargeCard(order); span.setAttributes({ 'payment.status': result.status }); return result; } catch (err) { span.setStatus({ code: SpanStatusCode.ERROR, message: err.message }); span.recordException(err); throw err; } finally { span.end(); } }); } ```
Sampling Strategies
| Strategy | When | Config | |----------|------|--------| | Always On | Dev/staging, low traffic (<100 rps) | `ratio: 1.0` | | Probabilistic | Moderate traffic (100-1000 rps) | `ratio: 0.1` (10%) | | Rate-limited | High traffic (>1000 rps) | `max_traces_per_second: 100` | | Tail-based | Want all errors + slow requests | Collector-side: keep if error OR duration > p99 | | Parent-based | Respect upstream decisions | If parent sampled, child sampled |
Recommendation: Start with parent-based + probabilistic (10%). Add tail-based at the collector to capture all errors.
Context Propagation
| Header | Standard | Format | |--------|----------|--------| | `traceparent` | W3C Trace Context | `00-{trace_id}-{span_id}-{flags}` | | `tracestate` | W3C Trace Context | Vendor-specific key-value pairs | | `b3` | Zipkin B3 | `{trace_id}-{span_id}-{sampled}` |
Rule: Use W3C Trace Context (`traceparent`) as primary. Support B3 for legacy Zipkin systems.
Trace Storage
| Volume | Solution | Retention | |--------|----------|-----------| | <50 GB/day | Jaeger + Elasticsearch | 7 days | | 50-500 GB/day | Tempo + S3 | 14 days | | 500+ GB/day | Tempo + S3 with aggressive sampling | 7 days | | Budget-constrained | Jaeger + Badger (local disk) | 3 days |
---
Phase 4: SLOs, SLIs & Error Budgets
SLI Selection by Service Type
| Service Type | Primary SLI | Secondary SLI | Measurement | |--------------|-------------|---------------|-------------| | API / Web | Availability + Latency | Error rate | Server-side + synthetic | | Data pipeline | Freshness + Correctness | Throughput | Pipeline timestamps + checksums | | Storage | Durability + Availability | Latency | Checksums + uptime monitoring | | Streaming | Throughput + Latency | Message loss rate | Consumer lag + e2e latency | | Batch jobs | Success rate + Freshness | Duration | Job scheduler metrics |
SLO Definition Template
```yaml slo: name: "Payment API Availability" service: payment-api owner: payments-team sli: type: availability definition: "Proportion of non-5xx responses" measurement: | sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m])) / sum(rate(http_requests_total{service="payment-api"}[5m])) target: 99.95% # 21.9 min downtime/month window: rolling_30d error_budget: total_minutes: 21.9 # per 30 days burn_rate_alerts: - severity: critical burn_rate: 14.4x # Budget consumed in 2 hours short_window: 5m long_window: 1h - severity: warning burn_rate: 6x # Budget consumed in 5 days short_window: 30m long_window: 6h - severity: ticket burn_rate: 1x # Budget consumed in 30 days short_window: 6h long_window: 3d consequences: budget_remaining_above_50pct: "Normal development velocity" budget_remaining_20_to_50pct: "Prioritize reliability work" budget_remaining_below_20pct: "Feature freeze; reliability only" budget_exhausted: "All hands on reliability until budget recovers" ```
Common SLO Targets
| Service Tier | Availability | p50 Latency | p99 Latency | Monthly Downtime | |--------------|-------------|-------------|-------------|------------------| | Tier 0 (payments, auth) | 99.99% | <100ms | <500ms | 4.3 min | | Tier 1 (core API) | 99.95% | <200ms | <1s | 21.9 min | | Tier 2 (non-critical) | 99.9% | <500ms | <2s | 43.8 min | | Tier 3 (internal tools) | 99.5% | <1s | <5s | 3.6 hours | | Batch / pipeline | 99% (success rate) | N/A | N/A | N/A |
Error Budget Tracking
```yaml # Weekly error budget review template error_budget_review: week: "2026-W08" service: payment-api slo_target: 99.95% budget: total_minutes_this_period: 21.9 consumed_minutes: 8.2 remaining_minutes: 13.7 remaining_percent: 62.6% incidents_consuming_budget: - date: "2026-02-18" duration_minutes: 5.1 cause: "Database connection pool exhaustion" preventable: true action: "Increase pool size + add saturation alert" - date: "2026-02-20" duration_minutes: 3.1 cause: "Upstream payment provider timeout" preventable: false action: "Add circuit breaker with fallback" velocity_decision: "Normal — 62.6% budget remaining" reliability_work_this_week: - "Add connection pool saturation alert" - "Implement circuit breaker for payment provider" ```
---
Phase 5: Alert Design
Alert Quality Principles
- Every alert must be actionable — if no one needs to act, it's not an alert
- Every alert needs a runbook — linked directly in the alert annotation
- Symptom-based over cause-based — alert on "users can't checkout" not "CPU high"
- Multi-window burn rate — not static thresholds (see SLO alerts above)
- Alert on absence, not just presence — "no orders in 15 min" catches silent failures
Alert Severity Levels
| Severity | Response Time | Channel | Who | Example | |----------|--------------|---------|-----|---------| | P0 — Critical | <5 min | Page (PagerDuty/Opsgenie) | On-call engineer | Payment system down | | P1 — High | <30 min | Page during business hours, Slack 24/7 | On-call | Error rate >5% for 10 min | | P2 — Medium | <4 hours | Slack channel | Team | p99 latency degraded 2x | | P3 — Low | Next business day | Ticket auto-created | Team backlog | Disk usage >80% | | Info | N/A | Dashboard only | No one | Deploy completed |
Alerting Anti-Patterns
| Anti-Pattern | Problem | Fix | |-------------|---------|-----| | Static CPU/memory thresholds | Noisy, not user-impacting | Use SLO-based burn rate alerts | | Alert per instance | 50 instances = 50 alerts for same issue | Aggregate: alert on service-level error rate | | No deduplication | Same alert fires 100 times | Group by service + alert name; set repeat interval | | Missing runbook | Engineer gets paged, doesn't know what to do | Every alert links to a runbook | | Threshold too sensitive | Fires on brief spikes | Use `for: 5m` to require sustained condition | | Too many P0s | Alert fatigue → ignoring real incidents | Audit monthly; demote or remove noisy alerts |
Alert Template (Prometheus Alertmanager)
```yaml groups: - name: payment-api-slo rules: - alert: PaymentAPIHighErrorRate expr: | ( sum(rate(http_requests_total{service="payment-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="payment-api"}[5m])) ) > 0.01 for: 5m labels: severity: critical service: payment-api team: payments annotations: summary: "Payment API error rate {{ $value | humanizePercentage }} (>1%)" description: "5xx error rate has exceeded 1% for 5 minutes" runbook: "https://wiki.internal/runbooks/payment-api-errors" dashboard: "https://grafana.internal/d/payment-api" - alert: PaymentAPINoTraffic expr: | sum(rate(http_requests_total{service="payment-api"}[15m])) == 0 for: 5m labels: severity: critical service: payment-api annotations: summary: "Payment API receiving zero traffic for 5 minutes" runbook: "https://wiki.internal/runbooks/payment-api-no-traffic"
- alert: PaymentAPILatencyHigh
- expr: |
- histogram_quantile(0.99,
- sum(rate(http_request_duration_seconds_bucket{service="payment-api"}[5m])) by (le)
- ) > 2
- for: 10m
- labels:
- severity: warning
- annotations:
- summary: "Payment API p99 latency {{ $value }}s (>2s for 10min)"
- runbook: "https://wiki.internal/runbooks/payment-api-latency"
- ```
Runbook Template
```markdown # Runbook: PaymentAPIHighErrorRate
What This Alert Means The payment API is returning >1% 5xx errors over a 5-minute window. Users are likely failing to complete checkouts.
Impact - Users cannot process payments - Revenue loss: ~$X per minute (based on average traffic) - SLO: Payment API availability (target: 99.95%)
Immediate Actions 1. Check the error dashboard: [link] 2. Check recent deploys: `kubectl rollout history deployment/payment-api` 3. Check upstream dependencies: - Database: [dashboard link] - Stripe API: [status page] - Redis cache: [dashboard link] 4. Check application logs: ``` kubectl logs -l app=payment-api --since=10m | jq 'select(.level=="error")' ```
Common Causes & Fixes | Cause | Diagnosis | Fix | |-------|-----------|-----| | Bad deploy | Errors started at deploy time | `kubectl rollout undo deployment/payment-api` | | DB connection exhaustion | `db_connections_active` at max | Restart pods (rolling) + increase pool size | | Stripe outage | Stripe status page red | Enable fallback payment processor | | Memory leak | Memory climbing, OOMKilled events | Rolling restart + investigate |
Escalation - If unresolved after 15 min: page payment team lead - If revenue impact >$10K: page VP Engineering - If Stripe outage: communicate to support team for customer messaging
Resolution - Confirm error rate <0.1% for 10 min - Post in #incidents: root cause + duration + impact - Schedule post-mortem if downtime >5 min ```
---
Phase 6: Dashboard Architecture
Dashboard Hierarchy
``` L1: Executive / Business Dashboard (non-technical stakeholders) ↓ L2: Service Overview Dashboard (on-call, quick triage) ↓ L3: Service Deep-Dive Dashboard (debugging specific service) ↓ L4: Infrastructure Dashboard (resource-level details) ```
L1: Business Dashboard
```yaml panels: - title: "Revenue per Minute" type: stat query: "sum(rate(orders_total{status='completed'}[5m])) * avg(order_value_dollars)" - title: "Active Users (5min)" type: stat query: "count(count by (user_id) (http_requests_total{...}[5m]))" - title: "Checkout Success Rate" type: gauge query: "sum(rate(checkout_total{status='success'}[1h])) / sum(rate(checkout_total[1h]))" thresholds: [95, 98, 99.5] - title: "Error Budget Remaining" type: gauge query: "1 - (error_budget_consumed / error_budget_total)" ```
L2: Service Overview Dashboard
Every service gets one of these with identical layout:
```yaml row_1_traffic: - "Request Rate (rps)" — timeseries, by status code - "Error Rate (%)" — timeseries, threshold line at SLO - "Active Requests" — gauge
row_2_latency: - "Latency Distribution" — heatmap - "p50 / p95 / p99" — timeseries, threshold lines - "Latency by Endpoint" — table, sorted by p99
row_3_dependencies: - "Downstream Latency" — timeseries per dependency - "Downstream Error Rate" — timeseries per dependency - "Database Query Duration" — timeseries by query type
row_4_resources: - "CPU Usage" — timeseries per pod - "Memory Usage" — timeseries per pod - "Pod Restarts" — stat
row_5_business: - "Business Metric 1" — service-specific - "Business Metric 2" — service-specific ```
Dashboard Rules
- Time range default: last 1 hour — most debugging happens in recent time
- Variable selectors at top: environment, service, instance
- Consistent color coding: green=good, yellow=degraded, red=bad across all dashboards
- Link alerts to dashboards — every alert annotation includes dashboard URL
- No more than 15 panels per dashboard — split into L3 if needed
- Include "as of" timestamp — so screenshots in incidents are unambiguous
- Dashboard as code — store Grafana JSON in git, provision via API
---
Phase 7: Incident Response
Incident Severity Classification
| Severity | Criteria | Response | Communication | |----------|----------|----------|---------------| | SEV-1 | Service down, data loss risk, security breach | All hands, war room | Status page update every 15 min | | SEV-2 | Degraded service, SLO at risk, partial outage | On-call + backup | Status page update every 30 min | | SEV-3 | Minor degradation, workaround exists | On-call during hours | Internal Slack update | | SEV-4 | Cosmetic, low impact | Next sprint | None |
Incident Roles
| Role | Responsibility | Who | |------|---------------|-----| | Incident Commander (IC) | Owns the incident. Coordinates. Makes decisions. | On-call lead | | Technical Lead | Diagnoses and fixes. Communicates technical status to IC. | Senior engineer | | Communications Lead | Updates status page, Slack, stakeholders. | Product/support | | Scribe | Documents timeline, actions, decisions in real-time. | Anyone available |
Incident Response Workflow
- ```
- DETECT
- - Alert fires → on-call paged
- - Customer report → support escalates
- - Internal discovery → engineer reports
- TRIAGE (first 5 minutes)
- - Confirm the issue is real (not false alert)
- - Classify severity (SEV-1 through SEV-4)
- - Open incident channel: #inc-YYYY-MM-DD-short-description
- - Assign roles (IC, Tech Lead, Comms)
- MITIGATE (next 5-30 minutes)
- - Goal: STOP THE BLEEDING, not find root cause
- - Options (try in order):
- a. Rollback last deploy
- b. Scale up / restart pods
- c. Toggle feature flag off
- d. Redirect traffic / enable fallback
- e. Manual data fix
- - Document every action with timestamp
- STABILIZE
- - Confirm mitigation is working (metrics back to normal)
- - Monitor for 15-30 min for recurrence
- - Update status page: "Monitoring fix"
- RESOLVE
- - Confirm all metrics healthy for 30+ min
- - Update status page: "Resolved"
- - Schedule post-mortem (within 48 hours for SEV-1/2)
- - Send internal summary to stakeholders
- ```
Incident Channel Template
``` 📋 Incident: Payment API 5xx Errors 🔴 Severity: SEV-2 🕐 Started: 2026-02-22 14:23 UTC 👤 IC: @alice 🔧 Tech Lead: @bob 📢 Comms: @charlie
Status: MITIGATING Impact: ~5% of checkout requests failing Customer-facing: Yes
Timeline: 14:23 — Alert fired: PaymentAPIHighErrorRate 14:25 — IC assigned: @alice, confirmed real via dashboard 14:28 — Tech Lead: error logs show connection pool exhaustion post-deploy 14:31 — Rolled back deployment v2.3.1 → v2.3.0 14:35 — Error rate dropping, monitoring 14:50 — Error rate <0.1%, marking resolved ```
---
Phase 8: Post-Mortem Framework
Blameless Post-Mortem Template
```yaml post_mortem: title: "Payment API Connection Pool Exhaustion" date: "2026-02-22" severity: SEV-2 duration: 27 minutes (14:23 — 14:50 UTC) authors: ["@alice", "@bob"] reviewers: ["@engineering-leads"] status: action_items_in_progress summary: | A deployment at 14:15 introduced a connection leak in the payment API. Connection pool was exhausted by 14:23, causing 5xx errors for ~5% of checkout requests. Rolled back at 14:31; recovered by 14:50. impact: user_impact: "~340 users saw checkout failures over 27 minutes" revenue_impact: "$2,100 estimated (based on average order value × failed checkouts)" slo_impact: "Consumed 5.1 min of 21.9 min monthly error budget (23%)" data_impact: "No data loss. 12 orders failed; users could retry successfully." timeline: - time: "14:15" event: "Deploy v2.3.1 rolled out (3/3 pods updated)" - time: "14:23" event: "PaymentAPIHighErrorRate alert fired" - time: "14:25" event: "IC assigned, confirmed via dashboard" - time: "14:28" event: "Root cause identified: new ORM query not releasing connections" - time: "14:31" event: "Rollback initiated: v2.3.1 → v2.3.0" - time: "14:35" event: "Error rate declining" - time: "14:50" event: "Resolved: error rate <0.1% sustained" root_cause: | The v2.3.1 deploy introduced a new database query in the order validation path. The query used a raw connection instead of the pool's managed client, so connections were acquired but never released. Under load, the pool exhausted within 8 minutes. contributing_factors: - "No integration test for connection pool behavior under load" - "Connection pool saturation metric existed but had no alert" - "Code review didn't catch raw connection usage" what_went_well: - "Alert fired within 8 minutes of deploy" - "IC assigned in 2 minutes" - "Root cause identified in 3 minutes (clear in logs)" - "Rollback executed cleanly" what_went_wrong: - "8-minute detection gap after deploy" - "No canary deployment to catch before full rollout" - "Connection pool saturation had no alert" action_items: - action: "Add connection pool saturation alert (>80% for 2 min)" owner: "@bob" priority: P1 due: "2026-02-25" status: in_progress ticket: "ENG-1234" - action: "Enable canary deployments for payment-api" owner: "@alice" priority: P1 due: "2026-03-01" ticket: "ENG-1235" - action: "Add linting rule: no raw DB connections in application code" owner: "@charlie" priority: P2 due: "2026-03-07" ticket: "ENG-1236" - action: "Load test payment-api connection pool in staging" owner: "@bob" priority: P2 due: "2026-03-07" ticket: "ENG-1237" lessons_learned: - "Resource saturation metrics need alerts, not just dashboards" - "Canary deployments are mandatory for Tier 0 services" - "ORM abstractions don't guarantee connection safety — review raw queries" ```
Post-Mortem Meeting Agenda (60 minutes)
- ```
- (5 min) Context setting — IC reads the summary
- (15 min) Timeline walkthrough — what happened, when, by whom
- (15 min) Root cause deep-dive — 5 Whys exercise
- (5 min) What went well — celebrate good response
- (15 min) Action items — assign owners, priorities, due dates
- (5 min) Wrap-up — review date for action item check-in
- ```
5 Whys Exercise
``` Problem: 5xx errors in payment API
Why 1: Database connections were exhausted Why 2: A new query acquired connections without releasing them Why 3: The query used a raw connection instead of the pool manager Why 4: The ORM's raw query API doesn't auto-release (by design) Why 5: We don't have a linting rule or code review checklist item for this
Root cause: Missing guard against raw connection usage in application code Systemic fix: Linting rule + connection pool saturation alerting ```
---
Phase 9: On-Call Operations
On-Call Structure
```yaml on_call: rotation: weekly handoff_day: Monday 10:00 UTC primary: response_time: 5 minutes (SEV-1/2), 30 minutes (SEV-3) escalation_after: 15 minutes no-ack secondary: response_time: 15 minutes (SEV-1), 1 hour (SEV-2/3) escalation_after: 30 minutes no-ack manager_escalation: trigger: SEV-1 unresolved after 30 minutes handoff_checklist: - Review open incidents and active alerts - Check error budget status for all services - Read post-mortems from previous week - Verify PagerDuty schedule and contact info - Test alert routing (send test page) ```
On-Call Health Metrics
| Metric | Healthy | Needs Attention | Unhealthy | |--------|---------|-----------------|-----------| | Pages per week | <5 | 5-15 | >15 | | After-hours pages per week | <2 | 2-5 | >5 | | False positive rate | <10% | 10-30% | >30% | | Mean time to acknowledge | <5 min | 5-15 min | >15 min | | Mean time to resolve | <30 min | 30-120 min | >120 min | | Toil ratio (manual vs automated) | <30% | 30-60% | >60% |
Weekly On-Call Review Template
```yaml on_call_review: week: "2026-W08" engineer: "@bob" incidents: total: 7 sev_1: 0 sev_2: 1 sev_3: 4 false_positives: 2 after_hours: 3 time_spent: incident_response: "4.5 hours" toil_automation: "2 hours" runbook_updates: "1 hour" improvements_made: - "Silenced noisy disk alert on dev servers" - "Added auto-remediation for pod restart threshold" improvements_needed: - "Cache expiry alert fires every Tuesday at 03:00 — needs investigation" - "Payment retry logic needs circuit breaker (caused 3 alerts)" handoff_notes: | Watch payment-api p99 latency — it's been creeping up since Wednesday. Stripe changed their sandbox endpoints; staging may throw errors. ```
---
Phase 10: Chaos Engineering & Reliability Testing
Chaos Principles
- Start with a hypothesis: "If X fails, the system should Y"
- Run in production (start small — one instance, one AZ)
- Minimize blast radius with automatic rollback
- Build confidence incrementally: staging → canary → production
Chaos Experiment Template
```yaml chaos_experiment: name: "Payment DB failover" hypothesis: "If the primary database becomes unavailable, traffic should failover to the replica within 30 seconds with <1% error rate spike" steady_state: - metric: "checkout_success_rate" expected: ">99.5%" - metric: "db_query_duration_p99" expected: "<200ms" injection: type: "network_partition" target: "payment-db-primary" duration: "5 minutes" blast_radius: "single AZ" abort_conditions: - "checkout_success_rate < 95% for > 60 seconds" - "revenue_per_minute drops > 50%" - "any SEV-1 incident declared" results: failover_time: "22 seconds" error_spike: "0.3% for 25 seconds" hypothesis_confirmed: true follow_up_actions: - "Document failover behavior in runbook" - "Add failover time as SLI (target: <30s)" ```
Chaos Engineering Maturity Levels
| Level | What You Test | Tools | |-------|--------------|-------| | 1: Manual | Kill a pod, see what happens | `kubectl delete pod` | | 2: Automated | Scheduled pod kills, network delays | Chaos Monkey, Litmus | | 3: Game Days | Multi-failure scenarios with team exercise | Custom scripts + coordination | | 4: Continuous | Automated chaos in production with auto-rollback | Gremlin, Chaos Mesh |
---
Phase 11: Observability Cost Optimization
Cost Drivers (Ranked)
| # | Driver | Typical % of Bill | Optimization | |---|--------|-------------------|-------------| | 1 | Log volume | 40-60% | Reduce verbosity, drop DEBUG, sample repetitive | | 2 | Metric cardinality | 15-25% | Drop unused metrics, limit labels | | 3 | Trace volume | 10-20% | Sampling, tail-based sampling | | 4 | Retention | 10-15% | Tiered storage (hot → warm → cold) | | 5 | Query cost | 5-10% | Optimize dashboard queries, set max scan limits |
Cost Reduction Checklist
```yaml cost_optimization: logs: - action: "Drop DEBUG/TRACE in production" savings: "30-50% of log volume" - action: "Sample health check logs (1:100)" savings: "5-15% of log volume" - action: "Deduplicate identical error bursts" savings: "10-20% during incidents" - action: "Move logs older than 7 days to S3/cold storage" savings: "60-80% of storage cost" - action: "Drop request/response body logging" savings: "20-40% of log volume" metrics: - action: "Audit unused metrics (no dashboard, no alert)" savings: "10-30% of series" - action: "Reduce histogram bucket count (default 11 → 8)" savings: "~27% of histogram series" - action: "Remove high-cardinality labels" savings: "Variable — can be massive" - action: "Increase scrape interval for non-critical metrics (15s → 60s)" savings: "75% of data points for those metrics" traces: - action: "Implement tail-based sampling" savings: "80-95% of trace volume" - action: "Drop internal health check traces" savings: "5-20% of trace volume" - action: "Reduce span attribute size (truncate long strings)" savings: "10-30% of trace storage" general: - action: "Review and right-size retention policies quarterly" - action: "Set query timeouts and result limits on dashboards" - action: "Use recording rules for expensive queries" ```
Monthly Cost Review Template
```yaml observability_cost_review: month: "February 2026" total_cost: "$X,XXX" breakdown: logs: { volume: "X TB", cost: "$X", pct: "X%" } metrics: { series: "X million", cost: "$X", pct: "X%" } traces: { volume: "X TB", cost: "$X", pct: "X%" } infrastructure: { instances: X, cost: "$X", pct: "X%" } cost_per: request: "$0.000X" service: "$X average" engineer: "$X per engineer" optimizations_applied: [] optimizations_planned: [] budget_status: "on_track | over_budget | under_budget" ```
---
Phase 12: Advanced Patterns
Correlation: Connecting the Three Pillars
``` Every log line includes: trace_id, span_id Every trace span includes: service, operation Every metric includes: service label
Correlation paths: Alert fires (metric) → Click → Dashboard (metric) → Filter by time window → Trace search (same service + time) → Find failing trace → Logs (filter by trace_id) → See exact error Support ticket (user report) → Find request_id in logs → Extract trace_id → View full trace → Identify slow span → Check span's service metrics → Confirm pattern ```
Synthetic Monitoring
```yaml synthetic_checks: - name: "Checkout flow" type: browser frequency: 5m locations: [us-east, eu-west, ap-southeast] steps: - navigate: "https://app.example.com/products" - click: "Add to Cart" - click: "Checkout" - assert: "Order confirmation page loads in <3s" alert_on: "2 consecutive failures from same location" - name: "API health" type: api frequency: 1m endpoints: - url: "https://api.example.com/health" expected_status: 200 max_latency_ms: 500 - url: "https://api.example.com/v1/products?limit=1" expected_status: 200 max_latency_ms: 1000 ```
Feature Flag Observability
```yaml # Correlate feature flags with metrics feature_flag_monitoring: - flag: "new_checkout_flow" metrics_to_compare: - "checkout_conversion_rate" # by flag variant - "checkout_error_rate" - "checkout_latency_p99" alerts: - "If error rate for new variant > 2x control, auto-disable flag" ```
Observability Maturity Model
| Dimension | Level 1 | Level 2 | Level 3 | Level 4 | |-----------|---------|---------|---------|---------| | Logging | Unstructured logs | Structured JSON, centralized | Correlated with traces | Automated log analysis | | Metrics | Basic infra metrics | RED/USE for services | SLO-based with error budgets | Predictive (anomaly detection) | | Tracing | No tracing | Key services instrumented | Full distributed tracing | Trace-driven testing | | Alerting | Static thresholds | Multi-signal alerts | Burn-rate based on SLOs | Auto-remediation | | Incident Response | Ad hoc | Defined process + roles | Post-mortems with action tracking | Chaos engineering in prod | | Culture | "Ops team handles it" | Shared ownership (you build it, you run it) | SLO-driven development velocity | Reliability as a feature |
---
Quality Scoring Rubric (0-100)
| Dimension | Weight | 0 | 5 | 10 | |-----------|--------|---|---|-----| | Logging quality | 15% | Unstructured, no correlation | Structured JSON, missing fields | Full schema, trace correlation, PII scrubbing | | Metrics coverage | 15% | No metrics | RED or USE, not both | RED + USE + business metrics + custom | | Tracing completeness | 10% | No tracing | Key services | Full path, sampling strategy, tail-based | | SLO maturity | 15% | No reliability targets | Informal targets | SLOs with error budgets, burn-rate alerts, weekly review | | Alert quality | 15% | Noisy/missing | Actionable, some runbooks | SLO-based, full runbooks, low false positive | | Incident response | 10% | Ad hoc | Defined process | Full process, roles, post-mortems, chaos engineering | | Dashboard design | 10% | No dashboards | Basic panels | Hierarchical L1-L4, consistent, linked to alerts | | Cost efficiency | 10% | Unknown cost | Tracked | Optimized, reviewed monthly, within budget |
90-100: World-class. Teach others. 70-89: Production-ready. Fill specific gaps. 50-69: Functional but fragile. <50: Significant reliability risk.
---
10 Observability Commandments
- Structured or it didn't happen — unstructured logs are technical debt
- Correlate everything — trace_id connects logs, traces, and metrics
- Alert on symptoms, not causes — users don't care about CPU, they care about latency
- Every alert gets a runbook — no runbook = no alert
- SLOs drive velocity — error budgets decide when to ship vs stabilize
- Dashboards have hierarchy — executives don't need pod CPU graphs
- Blameless post-mortems always — blame prevents learning
- Cost is a feature — observability that bankrupts you isn't observability
- You build it, you run it — the team that ships code owns its observability
- Practice failure — chaos engineering builds confidence
---
12 Natural Language Commands
| Command | What It Does | |---------|-------------| | "Audit our observability" | Run the /16 health check, score each dimension, prioritize gaps | | "Design logging for [service]" | Generate structured log schema with context fields for the service | | "Set up metrics for [service]" | Create RED + USE + business metric instrumentation plan | | "Create SLOs for [service]" | Define SLIs, targets, error budgets, and burn-rate alert rules | | "Design alerts for [service]" | Create alert rules with severity, thresholds, and runbook templates | | "Build dashboard for [service]" | Design L2 service overview dashboard with panel specifications | | "Write a runbook for [alert]" | Generate structured runbook with diagnosis steps and fixes | | "Run post-mortem for [incident]" | Generate blameless post-mortem document with timeline and action items | | "Set up on-call for [team]" | Design rotation, escalation policy, handoff checklist | | "Plan chaos experiment for [scenario]" | Design experiment with hypothesis, injection, abort conditions | | "Optimize observability costs" | Audit current spend, identify top savings, create reduction plan | | "Design tracing for [system]" | Create OpenTelemetry instrumentation plan with sampling strategy |
---
⚡ Level Up Your Observability
This skill gives you the methodology. For industry-specific implementation patterns:
- SaaS companies: AfrexAI SaaS Context Pack ($47) — includes SaaS-specific SLOs, multi-tenant monitoring, and usage-based billing observability
- Fintech: AfrexAI Fintech Context Pack ($47) — compliance audit logging, transaction monitoring, fraud detection signals
- Healthcare: AfrexAI Healthcare Context Pack ($47) — HIPAA audit trails, PHI access logging, uptime requirements
🔗 More Free Skills by AfrexAI
- `afrexai-devops-engine` — CI/CD, infrastructure, deployment strategies
- `afrexai-api-architect` — API design, security, versioning
- `afrexai-database-engineering` — Schema design, query optimization, migrations
- `afrexai-code-reviewer` — Code review methodology with SPEAR framework
- `afrexai-prompt-engineering` — System prompt design, testing, optimization
Browse all AfrexAI skills: clawhub.com | Full storefront
Use Cases
- Set up structured logging and distributed tracing for production services
- Define and implement Service Level Objectives (SLOs) for reliability engineering
- Build incident response playbooks with clear escalation procedures
- Score your current observability posture using the built-in health check
- Design alerting rules that reduce noise and catch real issues
Pros & Cons
Pros
- +Comprehensive coverage from logging to tracing to SLO-driven development
- +Includes a 16-point health check for quick observability posture assessment
- +Covers incident response alongside monitoring — not just data collection
Cons
- -Methodology and framework only — does not deploy actual monitoring infrastructure
- -Requires existing familiarity with observability concepts and tooling
- -No specific tool integrations (Datadog, Grafana, etc.) — platform-agnostic but generic
FAQ
What does Observability & Reliability Engineering do?
What platforms support Observability & Reliability Engineering?
What are the use cases for Observability & Reliability Engineering?
100+ free AI tools
Writing, PDF, image, and developer tools — all in your browser.
Next Step
Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.