Skip to content

Observability & Reliability Engineering

Verified

Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, buil...

147 downloads
$ Add to .claude/skills/

About This Skill

# Observability & Reliability Engineering

Complete system for building observable, reliable services — from structured logging to incident response to SLO-driven development.

---

Quick Health Check (/16)

Score your current observability posture:

| Signal | Healthy (2) | Weak (1) | Missing (0) | |--------|-------------|----------|-------------| | Structured logging | JSON logs with trace_id correlation | Logs exist but unstructured | Console.log / print statements | | Metrics collection | RED/USE metrics with dashboards | Some metrics, no dashboards | No metrics | | Distributed tracing | Full request path with sampling | Partial traces, key services only | No tracing | | Alerting | SLO-based alerts with runbooks | Threshold alerts, some runbooks | No alerts or all-noise | | Incident response | Defined process with roles + post-mortems | Ad-hoc response, some docs | "Whoever notices fixes it" | | SLOs defined | SLOs with error budgets tracked weekly | Informal availability targets | No reliability targets | | On-call rotation | Structured rotation with escalation | Informal "call someone" | No on-call | | Cost management | Observability budget tracked monthly | Some awareness of costs | No idea what you spend |

12-16: Production-grade. Focus on optimization. 8-11: Foundation exists. Fill the gaps systematically. 4-7: Significant risk. Prioritize alerting + incident response. 0-3: Flying blind. Start with Phase 1 immediately.

---

Phase 1: Structured Logging

Log Architecture

``` Application → Structured JSON → Log Router → Storage → Query Engine ↓ Alert Pipeline ```

Required Fields (Every Log Line)

| Field | Type | Purpose | Example | |-------|------|---------|---------| | `timestamp` | ISO-8601 UTC | When | `2026-02-22T18:30:00.123Z` | | `level` | enum | Severity | `info`, `warn`, `error`, `fatal` | | `service` | string | Which service | `payment-api` | | `version` | string | Which deploy | `v2.3.1` | | `environment` | string | Which env | `production` | | `message` | string | What happened | `Payment processed successfully` | | `trace_id` | string | Request correlation | `abc123def456` | | `span_id` | string | Operation within trace | `span_789` | | `duration_ms` | number | How long | `142` |

Contextual Fields (Add Per Domain)

```yaml # HTTP request context http: method: POST path: /api/v1/orders status: 201 client_ip: 203.0.113.42 # Anonymize in logs if needed user_agent: "Mozilla/5.0..." request_id: "req_abc123"

# Business context business: user_id: "usr_456" tenant_id: "tenant_789" order_id: "ord_012" action: "checkout" amount_cents: 4999 currency: "USD"

# Error context error: type: "PaymentDeclinedError" message: "Card declined: insufficient funds" code: "CARD_DECLINED" stack: "..." # Only in non-production or DEBUG level retry_count: 2 retryable: true ```

Log Level Decision Tree

``` Is the process about to crash? → FATAL (exit after logging)

Did an operation fail that needs human attention? → ERROR (page someone or create ticket)

Did something unexpected happen but we recovered? → WARN (review in daily triage)

Is this a normal business event worth recording? → INFO (audit trail, business metrics)

Is this useful for debugging but noisy in production? → DEBUG (off in prod, on in staging)

Is this only useful when stepping through code? → TRACE (never in production) ```

Log Level Rules

  1. ERROR means action required — if no one needs to act on it, it's WARN
  2. INFO is for business events — not internal implementation details
  3. No logging inside tight loops — aggregate and log summary
  4. Log at boundaries — API entry/exit, queue consume/publish, DB calls
  5. Never log secrets — API keys, tokens, passwords, PII (see scrubbing below)

PII & Secret Scrubbing

```yaml scrub_patterns: # Always redact - field_patterns: ["password", "secret", "token", "api_key", "authorization"] action: replace_with_redacted # Hash for correlation without exposure - field_patterns: ["email", "phone", "ssn", "national_id"] action: sha256_hash # Mask partially - field_patterns: ["credit_card", "card_number"] action: mask_last_4 # "**--**-1234" # IP anonymization - field_patterns: ["client_ip", "ip_address"] action: zero_last_octet # 203.0.113.0 ```

Logger Setup (By Language)

Node.js (Pino): ```typescript import pino from 'pino'; import { AsyncLocalStorage } from 'node:async_hooks';

const als = new AsyncLocalStorage<Record<string, string>>();

const logger = pino({ level: process.env.LOG_LEVEL || 'info', formatters: { level: (label) => ({ level: label }), }, mixin: () => als.getStore() ?? {}, redact: ['req.headers.authorization', '*.password', '*.token'], timestamp: pino.stdTimeFunctions.isoTime, });

// Middleware: inject context app.use((req, res, next) => { const ctx = { trace_id: req.headers['x-trace-id'] || crypto.randomUUID(), request_id: crypto.randomUUID(), service: 'payment-api', version: process.env.APP_VERSION, }; als.run(ctx, () => next()); }); ```

Python (structlog): ```python import structlog structlog.configure( processors=[ structlog.contextvars.merge_contextvars, structlog.processors.add_log_level, structlog.processors.TimeStamper(fmt="iso", utc=True), structlog.processors.JSONRenderer(), ], ) log = structlog.get_logger() # Bind context per-request: structlog.contextvars.bind_contextvars(trace_id=trace_id, user_id=user_id) ```

Go (zerolog): ```go log := zerolog.New(os.Stdout).With(). Timestamp(). Str("service", "payment-api"). Str("version", version). Logger() // Per-request: reqLog := log.With().Str("trace_id", traceID).Logger() ```

Log Storage Decision

| Volume | Solution | Retention | Cost | |--------|----------|-----------|------| | <10 GB/day | Loki + Grafana | 30 days hot, 90 days cold | Low | | 10-100 GB/day | Elasticsearch / OpenSearch | 14 days hot, 90 days S3 | Medium | | 100+ GB/day | ClickHouse or Datadog | 7 days hot, 30 days archive | High | | Budget-constrained | Loki + S3 backend | 90 days all cold | Very low |

10 Logging Anti-Patterns

| # | Anti-Pattern | Fix | |---|-------------|-----| | 1 | `log.error(err)` with no context | Always include: what operation, what input, what state | | 2 | Logging request/response bodies | Log only in DEBUG; redact sensitive fields | | 3 | String concatenation in log messages | Use structured fields: `log.info("processed", { order_id, amount })` | | 4 | Catch-and-log-and-rethrow | Log at the boundary where you handle it, not every layer | | 5 | Different log formats per service | Standardize schema across all services | | 6 | No log rotation / retention policy | Set max size + TTL; archive to cold storage | | 7 | Logging inside hot paths | Aggregate: log summary every N items or every interval | | 8 | Missing correlation IDs | Propagate trace_id from first entry point through all services | | 9 | Boolean log levels (`verbose: true`) | Use standard levels with configurable minimum | | 10 | Logging PII in plain text | Implement scrubbing at the logger level |

---

Phase 2: Metrics Collection

The RED Method (Request-Driven Services)

For every service endpoint, track:

| Metric | What | Prometheus Example | |--------|------|--------------------| | Rate | Requests per second | `http_requests_total{method, path, status}` | | Errors | Failed requests per second | `http_requests_total{status=~"5.."}` / total | | Duration | Latency distribution | `http_request_duration_seconds{method, path}` (histogram) |

The USE Method (Infrastructure Resources)

For every resource (CPU, memory, disk, network):

| Metric | What | Example | |--------|------|---------| | Utilization | % resource busy | CPU usage 78% | | Saturation | Queue depth / backpressure | 12 requests queued | | Errors | Resource errors | 3 disk I/O errors |

Golden Signals (Google SRE)

| Signal | Meaning | Source | |--------|---------|--------| | Latency | Time to serve requests | RED Duration | | Traffic | Demand on the system | RED Rate | | Errors | Rate of failed requests | RED Errors | | Saturation | How "full" the service is | USE Saturation |

Metric Types & When to Use Each

| Type | Use Case | Example | |------|----------|---------| | Counter | Things that only go up | Total requests, errors, bytes sent | | Gauge | Current value that goes up/down | Active connections, queue depth, temperature | | Histogram | Distribution of values | Request latency, response size | | Summary | Pre-calculated percentiles | Client-side latency (when you need exact percentiles) |

Rule: Use histograms over summaries in most cases — they're aggregatable across instances.

Naming Conventions

``` # Pattern: <namespace>_<subsystem>_<name>_<unit> http_server_request_duration_seconds http_server_requests_total db_pool_connections_active queue_messages_pending cache_hit_ratio

# Rules: # 1. Use snake_case # 2. Include unit suffix (_seconds, _bytes, _total) # 3. _total suffix for counters # 4. Don't include label names in metric name # 5. Use base units (seconds not milliseconds, bytes not kilobytes) ```

Label Design Rules

| Rule | Why | Example | |------|-----|---------| | Keep cardinality <100 per label | High cardinality kills performance | `status="200"` not `status="200 OK"` | | No user IDs as labels | Unbounded cardinality | Use log correlation instead | | No request paths with IDs | `/api/users/123` creates millions of series | Normalize: `/api/users/:id` | | Max 5-7 labels per metric | Each combo = a time series | `{method, path, status, service}` |

Instrumentation Checklist

```yaml application_metrics: # HTTP layer - http_request_duration_seconds: histogram {method, path, status} - http_request_size_bytes: histogram {method, path} - http_response_size_bytes: histogram {method, path} - http_requests_in_flight: gauge # Business logic - orders_processed_total: counter {status, payment_method} - order_value_dollars: histogram {payment_method} - user_signups_total: counter {source} # Dependencies - db_query_duration_seconds: histogram {query_type, table} - db_connections_active: gauge {pool} - db_connections_idle: gauge {pool} - cache_requests_total: counter {result: hit|miss} - external_api_duration_seconds: histogram {service, endpoint} - external_api_errors_total: counter {service, error_type} # Queue / async - queue_messages_published_total: counter {queue} - queue_messages_consumed_total: counter {queue, status} - queue_processing_duration_seconds: histogram {queue} - queue_depth: gauge {queue} - queue_consumer_lag: gauge {queue, consumer_group}

infrastructure_metrics: # Node exporter / cAdvisor provides these automatically - cpu_usage_percent: gauge {instance} - memory_usage_bytes: gauge {instance} - disk_usage_bytes: gauge {instance, mount} - disk_io_seconds: counter {instance, device} - network_bytes: counter {instance, direction} - container_cpu_usage: gauge {pod, container} - container_memory_usage: gauge {pod, container} ```

Stack Recommendations

| Component | Options | Recommendation | |-----------|---------|----------------| | Collection | Prometheus, OTEL Collector, Datadog Agent | Prometheus (free) or OTEL Collector (vendor-neutral) | | Storage | Prometheus, Thanos, Mimir, VictoriaMetrics | VictoriaMetrics (best cost/perf) or Mimir (Grafana ecosystem) | | Visualization | Grafana, Datadog, New Relic | Grafana (free, extensible) | | Alerting | Alertmanager, Grafana Alerting, PagerDuty | Alertmanager + PagerDuty routing |

---

Phase 3: Distributed Tracing

Trace Architecture

``` Client Request → API Gateway (root span) → Auth Service (child span) → Order Service (child span) → Database Query (child span) → Payment Service (child span) → Stripe API (child span) → Notification Service (child span) → Email Provider (child span) ```

OpenTelemetry Setup

Auto-instrumentation (Node.js): ```typescript // tracing.ts — import BEFORE anything else import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-http': { ignoreIncomingPaths: ['/health', '/ready'] }, '@opentelemetry/instrumentation-express': { enabled: true }, })], serviceName: process.env.OTEL_SERVICE_NAME || 'payment-api', }); sdk.start(); ```

Custom spans for business logic: ```typescript import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service');

async function processPayment(order: Order) { return tracer.startActiveSpan('process-payment', async (span) => { span.setAttributes({ 'order.id': order.id, 'order.amount_cents': order.amountCents, 'payment.method': order.paymentMethod, }); try { const result = await chargeCard(order); span.setAttributes({ 'payment.status': result.status }); return result; } catch (err) { span.setStatus({ code: SpanStatusCode.ERROR, message: err.message }); span.recordException(err); throw err; } finally { span.end(); } }); } ```

Sampling Strategies

| Strategy | When | Config | |----------|------|--------| | Always On | Dev/staging, low traffic (<100 rps) | `ratio: 1.0` | | Probabilistic | Moderate traffic (100-1000 rps) | `ratio: 0.1` (10%) | | Rate-limited | High traffic (>1000 rps) | `max_traces_per_second: 100` | | Tail-based | Want all errors + slow requests | Collector-side: keep if error OR duration > p99 | | Parent-based | Respect upstream decisions | If parent sampled, child sampled |

Recommendation: Start with parent-based + probabilistic (10%). Add tail-based at the collector to capture all errors.

Context Propagation

| Header | Standard | Format | |--------|----------|--------| | `traceparent` | W3C Trace Context | `00-{trace_id}-{span_id}-{flags}` | | `tracestate` | W3C Trace Context | Vendor-specific key-value pairs | | `b3` | Zipkin B3 | `{trace_id}-{span_id}-{sampled}` |

Rule: Use W3C Trace Context (`traceparent`) as primary. Support B3 for legacy Zipkin systems.

Trace Storage

| Volume | Solution | Retention | |--------|----------|-----------| | <50 GB/day | Jaeger + Elasticsearch | 7 days | | 50-500 GB/day | Tempo + S3 | 14 days | | 500+ GB/day | Tempo + S3 with aggressive sampling | 7 days | | Budget-constrained | Jaeger + Badger (local disk) | 3 days |

---

Phase 4: SLOs, SLIs & Error Budgets

SLI Selection by Service Type

| Service Type | Primary SLI | Secondary SLI | Measurement | |--------------|-------------|---------------|-------------| | API / Web | Availability + Latency | Error rate | Server-side + synthetic | | Data pipeline | Freshness + Correctness | Throughput | Pipeline timestamps + checksums | | Storage | Durability + Availability | Latency | Checksums + uptime monitoring | | Streaming | Throughput + Latency | Message loss rate | Consumer lag + e2e latency | | Batch jobs | Success rate + Freshness | Duration | Job scheduler metrics |

SLO Definition Template

```yaml slo: name: "Payment API Availability" service: payment-api owner: payments-team sli: type: availability definition: "Proportion of non-5xx responses" measurement: | sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m])) / sum(rate(http_requests_total{service="payment-api"}[5m])) target: 99.95% # 21.9 min downtime/month window: rolling_30d error_budget: total_minutes: 21.9 # per 30 days burn_rate_alerts: - severity: critical burn_rate: 14.4x # Budget consumed in 2 hours short_window: 5m long_window: 1h - severity: warning burn_rate: 6x # Budget consumed in 5 days short_window: 30m long_window: 6h - severity: ticket burn_rate: 1x # Budget consumed in 30 days short_window: 6h long_window: 3d consequences: budget_remaining_above_50pct: "Normal development velocity" budget_remaining_20_to_50pct: "Prioritize reliability work" budget_remaining_below_20pct: "Feature freeze; reliability only" budget_exhausted: "All hands on reliability until budget recovers" ```

Common SLO Targets

| Service Tier | Availability | p50 Latency | p99 Latency | Monthly Downtime | |--------------|-------------|-------------|-------------|------------------| | Tier 0 (payments, auth) | 99.99% | <100ms | <500ms | 4.3 min | | Tier 1 (core API) | 99.95% | <200ms | <1s | 21.9 min | | Tier 2 (non-critical) | 99.9% | <500ms | <2s | 43.8 min | | Tier 3 (internal tools) | 99.5% | <1s | <5s | 3.6 hours | | Batch / pipeline | 99% (success rate) | N/A | N/A | N/A |

Error Budget Tracking

```yaml # Weekly error budget review template error_budget_review: week: "2026-W08" service: payment-api slo_target: 99.95% budget: total_minutes_this_period: 21.9 consumed_minutes: 8.2 remaining_minutes: 13.7 remaining_percent: 62.6% incidents_consuming_budget: - date: "2026-02-18" duration_minutes: 5.1 cause: "Database connection pool exhaustion" preventable: true action: "Increase pool size + add saturation alert" - date: "2026-02-20" duration_minutes: 3.1 cause: "Upstream payment provider timeout" preventable: false action: "Add circuit breaker with fallback" velocity_decision: "Normal — 62.6% budget remaining" reliability_work_this_week: - "Add connection pool saturation alert" - "Implement circuit breaker for payment provider" ```

---

Phase 5: Alert Design

Alert Quality Principles

  1. Every alert must be actionable — if no one needs to act, it's not an alert
  2. Every alert needs a runbook — linked directly in the alert annotation
  3. Symptom-based over cause-based — alert on "users can't checkout" not "CPU high"
  4. Multi-window burn rate — not static thresholds (see SLO alerts above)
  5. Alert on absence, not just presence — "no orders in 15 min" catches silent failures

Alert Severity Levels

| Severity | Response Time | Channel | Who | Example | |----------|--------------|---------|-----|---------| | P0 — Critical | <5 min | Page (PagerDuty/Opsgenie) | On-call engineer | Payment system down | | P1 — High | <30 min | Page during business hours, Slack 24/7 | On-call | Error rate >5% for 10 min | | P2 — Medium | <4 hours | Slack channel | Team | p99 latency degraded 2x | | P3 — Low | Next business day | Ticket auto-created | Team backlog | Disk usage >80% | | Info | N/A | Dashboard only | No one | Deploy completed |

Alerting Anti-Patterns

| Anti-Pattern | Problem | Fix | |-------------|---------|-----| | Static CPU/memory thresholds | Noisy, not user-impacting | Use SLO-based burn rate alerts | | Alert per instance | 50 instances = 50 alerts for same issue | Aggregate: alert on service-level error rate | | No deduplication | Same alert fires 100 times | Group by service + alert name; set repeat interval | | Missing runbook | Engineer gets paged, doesn't know what to do | Every alert links to a runbook | | Threshold too sensitive | Fires on brief spikes | Use `for: 5m` to require sustained condition | | Too many P0s | Alert fatigue → ignoring real incidents | Audit monthly; demote or remove noisy alerts |

Alert Template (Prometheus Alertmanager)

```yaml groups: - name: payment-api-slo rules: - alert: PaymentAPIHighErrorRate expr: | ( sum(rate(http_requests_total{service="payment-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="payment-api"}[5m])) ) > 0.01 for: 5m labels: severity: critical service: payment-api team: payments annotations: summary: "Payment API error rate {{ $value | humanizePercentage }} (>1%)" description: "5xx error rate has exceeded 1% for 5 minutes" runbook: "https://wiki.internal/runbooks/payment-api-errors" dashboard: "https://grafana.internal/d/payment-api" - alert: PaymentAPINoTraffic expr: | sum(rate(http_requests_total{service="payment-api"}[15m])) == 0 for: 5m labels: severity: critical service: payment-api annotations: summary: "Payment API receiving zero traffic for 5 minutes" runbook: "https://wiki.internal/runbooks/payment-api-no-traffic"

  • alert: PaymentAPILatencyHigh
  • expr: |
  • histogram_quantile(0.99,
  • sum(rate(http_request_duration_seconds_bucket{service="payment-api"}[5m])) by (le)
  • ) > 2
  • for: 10m
  • labels:
  • severity: warning
  • annotations:
  • summary: "Payment API p99 latency {{ $value }}s (>2s for 10min)"
  • runbook: "https://wiki.internal/runbooks/payment-api-latency"
  • ```

Runbook Template

```markdown # Runbook: PaymentAPIHighErrorRate

What This Alert Means The payment API is returning >1% 5xx errors over a 5-minute window. Users are likely failing to complete checkouts.

Impact - Users cannot process payments - Revenue loss: ~$X per minute (based on average traffic) - SLO: Payment API availability (target: 99.95%)

Immediate Actions 1. Check the error dashboard: [link] 2. Check recent deploys: `kubectl rollout history deployment/payment-api` 3. Check upstream dependencies: - Database: [dashboard link] - Stripe API: [status page] - Redis cache: [dashboard link] 4. Check application logs: ``` kubectl logs -l app=payment-api --since=10m | jq 'select(.level=="error")' ```

Common Causes & Fixes | Cause | Diagnosis | Fix | |-------|-----------|-----| | Bad deploy | Errors started at deploy time | `kubectl rollout undo deployment/payment-api` | | DB connection exhaustion | `db_connections_active` at max | Restart pods (rolling) + increase pool size | | Stripe outage | Stripe status page red | Enable fallback payment processor | | Memory leak | Memory climbing, OOMKilled events | Rolling restart + investigate |

Escalation - If unresolved after 15 min: page payment team lead - If revenue impact >$10K: page VP Engineering - If Stripe outage: communicate to support team for customer messaging

Resolution - Confirm error rate <0.1% for 10 min - Post in #incidents: root cause + duration + impact - Schedule post-mortem if downtime >5 min ```

---

Phase 6: Dashboard Architecture

Dashboard Hierarchy

``` L1: Executive / Business Dashboard (non-technical stakeholders) ↓ L2: Service Overview Dashboard (on-call, quick triage) ↓ L3: Service Deep-Dive Dashboard (debugging specific service) ↓ L4: Infrastructure Dashboard (resource-level details) ```

L1: Business Dashboard

```yaml panels: - title: "Revenue per Minute" type: stat query: "sum(rate(orders_total{status='completed'}[5m])) * avg(order_value_dollars)" - title: "Active Users (5min)" type: stat query: "count(count by (user_id) (http_requests_total{...}[5m]))" - title: "Checkout Success Rate" type: gauge query: "sum(rate(checkout_total{status='success'}[1h])) / sum(rate(checkout_total[1h]))" thresholds: [95, 98, 99.5] - title: "Error Budget Remaining" type: gauge query: "1 - (error_budget_consumed / error_budget_total)" ```

L2: Service Overview Dashboard

Every service gets one of these with identical layout:

```yaml row_1_traffic: - "Request Rate (rps)" — timeseries, by status code - "Error Rate (%)" — timeseries, threshold line at SLO - "Active Requests" — gauge

row_2_latency: - "Latency Distribution" — heatmap - "p50 / p95 / p99" — timeseries, threshold lines - "Latency by Endpoint" — table, sorted by p99

row_3_dependencies: - "Downstream Latency" — timeseries per dependency - "Downstream Error Rate" — timeseries per dependency - "Database Query Duration" — timeseries by query type

row_4_resources: - "CPU Usage" — timeseries per pod - "Memory Usage" — timeseries per pod - "Pod Restarts" — stat

row_5_business: - "Business Metric 1" — service-specific - "Business Metric 2" — service-specific ```

Dashboard Rules

  1. Time range default: last 1 hour — most debugging happens in recent time
  2. Variable selectors at top: environment, service, instance
  3. Consistent color coding: green=good, yellow=degraded, red=bad across all dashboards
  4. Link alerts to dashboards — every alert annotation includes dashboard URL
  5. No more than 15 panels per dashboard — split into L3 if needed
  6. Include "as of" timestamp — so screenshots in incidents are unambiguous
  7. Dashboard as code — store Grafana JSON in git, provision via API

---

Phase 7: Incident Response

Incident Severity Classification

| Severity | Criteria | Response | Communication | |----------|----------|----------|---------------| | SEV-1 | Service down, data loss risk, security breach | All hands, war room | Status page update every 15 min | | SEV-2 | Degraded service, SLO at risk, partial outage | On-call + backup | Status page update every 30 min | | SEV-3 | Minor degradation, workaround exists | On-call during hours | Internal Slack update | | SEV-4 | Cosmetic, low impact | Next sprint | None |

Incident Roles

| Role | Responsibility | Who | |------|---------------|-----| | Incident Commander (IC) | Owns the incident. Coordinates. Makes decisions. | On-call lead | | Technical Lead | Diagnoses and fixes. Communicates technical status to IC. | Senior engineer | | Communications Lead | Updates status page, Slack, stakeholders. | Product/support | | Scribe | Documents timeline, actions, decisions in real-time. | Anyone available |

Incident Response Workflow

  1. ```
  2. DETECT
  3. - Alert fires → on-call paged
  4. - Customer report → support escalates
  5. - Internal discovery → engineer reports
  6. TRIAGE (first 5 minutes)
  7. - Confirm the issue is real (not false alert)
  8. - Classify severity (SEV-1 through SEV-4)
  9. - Open incident channel: #inc-YYYY-MM-DD-short-description
  10. - Assign roles (IC, Tech Lead, Comms)
  11. MITIGATE (next 5-30 minutes)
  12. - Goal: STOP THE BLEEDING, not find root cause
  13. - Options (try in order):
  14. a. Rollback last deploy
  15. b. Scale up / restart pods
  16. c. Toggle feature flag off
  17. d. Redirect traffic / enable fallback
  18. e. Manual data fix
  19. - Document every action with timestamp
  20. STABILIZE
  21. - Confirm mitigation is working (metrics back to normal)
  22. - Monitor for 15-30 min for recurrence
  23. - Update status page: "Monitoring fix"
  24. RESOLVE
  25. - Confirm all metrics healthy for 30+ min
  26. - Update status page: "Resolved"
  27. - Schedule post-mortem (within 48 hours for SEV-1/2)
  28. - Send internal summary to stakeholders
  29. ```

Incident Channel Template

``` 📋 Incident: Payment API 5xx Errors 🔴 Severity: SEV-2 🕐 Started: 2026-02-22 14:23 UTC 👤 IC: @alice 🔧 Tech Lead: @bob 📢 Comms: @charlie

Status: MITIGATING Impact: ~5% of checkout requests failing Customer-facing: Yes

Timeline: 14:23 — Alert fired: PaymentAPIHighErrorRate 14:25 — IC assigned: @alice, confirmed real via dashboard 14:28 — Tech Lead: error logs show connection pool exhaustion post-deploy 14:31 — Rolled back deployment v2.3.1 → v2.3.0 14:35 — Error rate dropping, monitoring 14:50 — Error rate <0.1%, marking resolved ```

---

Phase 8: Post-Mortem Framework

Blameless Post-Mortem Template

```yaml post_mortem: title: "Payment API Connection Pool Exhaustion" date: "2026-02-22" severity: SEV-2 duration: 27 minutes (14:23 — 14:50 UTC) authors: ["@alice", "@bob"] reviewers: ["@engineering-leads"] status: action_items_in_progress summary: | A deployment at 14:15 introduced a connection leak in the payment API. Connection pool was exhausted by 14:23, causing 5xx errors for ~5% of checkout requests. Rolled back at 14:31; recovered by 14:50. impact: user_impact: "~340 users saw checkout failures over 27 minutes" revenue_impact: "$2,100 estimated (based on average order value × failed checkouts)" slo_impact: "Consumed 5.1 min of 21.9 min monthly error budget (23%)" data_impact: "No data loss. 12 orders failed; users could retry successfully." timeline: - time: "14:15" event: "Deploy v2.3.1 rolled out (3/3 pods updated)" - time: "14:23" event: "PaymentAPIHighErrorRate alert fired" - time: "14:25" event: "IC assigned, confirmed via dashboard" - time: "14:28" event: "Root cause identified: new ORM query not releasing connections" - time: "14:31" event: "Rollback initiated: v2.3.1 → v2.3.0" - time: "14:35" event: "Error rate declining" - time: "14:50" event: "Resolved: error rate <0.1% sustained" root_cause: | The v2.3.1 deploy introduced a new database query in the order validation path. The query used a raw connection instead of the pool's managed client, so connections were acquired but never released. Under load, the pool exhausted within 8 minutes. contributing_factors: - "No integration test for connection pool behavior under load" - "Connection pool saturation metric existed but had no alert" - "Code review didn't catch raw connection usage" what_went_well: - "Alert fired within 8 minutes of deploy" - "IC assigned in 2 minutes" - "Root cause identified in 3 minutes (clear in logs)" - "Rollback executed cleanly" what_went_wrong: - "8-minute detection gap after deploy" - "No canary deployment to catch before full rollout" - "Connection pool saturation had no alert" action_items: - action: "Add connection pool saturation alert (>80% for 2 min)" owner: "@bob" priority: P1 due: "2026-02-25" status: in_progress ticket: "ENG-1234" - action: "Enable canary deployments for payment-api" owner: "@alice" priority: P1 due: "2026-03-01" ticket: "ENG-1235" - action: "Add linting rule: no raw DB connections in application code" owner: "@charlie" priority: P2 due: "2026-03-07" ticket: "ENG-1236" - action: "Load test payment-api connection pool in staging" owner: "@bob" priority: P2 due: "2026-03-07" ticket: "ENG-1237" lessons_learned: - "Resource saturation metrics need alerts, not just dashboards" - "Canary deployments are mandatory for Tier 0 services" - "ORM abstractions don't guarantee connection safety — review raw queries" ```

Post-Mortem Meeting Agenda (60 minutes)

  1. ```
  2. (5 min) Context setting — IC reads the summary
  3. (15 min) Timeline walkthrough — what happened, when, by whom
  4. (15 min) Root cause deep-dive — 5 Whys exercise
  5. (5 min) What went well — celebrate good response
  6. (15 min) Action items — assign owners, priorities, due dates
  7. (5 min) Wrap-up — review date for action item check-in
  8. ```

5 Whys Exercise

``` Problem: 5xx errors in payment API

Why 1: Database connections were exhausted Why 2: A new query acquired connections without releasing them Why 3: The query used a raw connection instead of the pool manager Why 4: The ORM's raw query API doesn't auto-release (by design) Why 5: We don't have a linting rule or code review checklist item for this

Root cause: Missing guard against raw connection usage in application code Systemic fix: Linting rule + connection pool saturation alerting ```

---

Phase 9: On-Call Operations

On-Call Structure

```yaml on_call: rotation: weekly handoff_day: Monday 10:00 UTC primary: response_time: 5 minutes (SEV-1/2), 30 minutes (SEV-3) escalation_after: 15 minutes no-ack secondary: response_time: 15 minutes (SEV-1), 1 hour (SEV-2/3) escalation_after: 30 minutes no-ack manager_escalation: trigger: SEV-1 unresolved after 30 minutes handoff_checklist: - Review open incidents and active alerts - Check error budget status for all services - Read post-mortems from previous week - Verify PagerDuty schedule and contact info - Test alert routing (send test page) ```

On-Call Health Metrics

| Metric | Healthy | Needs Attention | Unhealthy | |--------|---------|-----------------|-----------| | Pages per week | <5 | 5-15 | >15 | | After-hours pages per week | <2 | 2-5 | >5 | | False positive rate | <10% | 10-30% | >30% | | Mean time to acknowledge | <5 min | 5-15 min | >15 min | | Mean time to resolve | <30 min | 30-120 min | >120 min | | Toil ratio (manual vs automated) | <30% | 30-60% | >60% |

Weekly On-Call Review Template

```yaml on_call_review: week: "2026-W08" engineer: "@bob" incidents: total: 7 sev_1: 0 sev_2: 1 sev_3: 4 false_positives: 2 after_hours: 3 time_spent: incident_response: "4.5 hours" toil_automation: "2 hours" runbook_updates: "1 hour" improvements_made: - "Silenced noisy disk alert on dev servers" - "Added auto-remediation for pod restart threshold" improvements_needed: - "Cache expiry alert fires every Tuesday at 03:00 — needs investigation" - "Payment retry logic needs circuit breaker (caused 3 alerts)" handoff_notes: | Watch payment-api p99 latency — it's been creeping up since Wednesday. Stripe changed their sandbox endpoints; staging may throw errors. ```

---

Phase 10: Chaos Engineering & Reliability Testing

Chaos Principles

  1. Start with a hypothesis: "If X fails, the system should Y"
  2. Run in production (start small — one instance, one AZ)
  3. Minimize blast radius with automatic rollback
  4. Build confidence incrementally: staging → canary → production

Chaos Experiment Template

```yaml chaos_experiment: name: "Payment DB failover" hypothesis: "If the primary database becomes unavailable, traffic should failover to the replica within 30 seconds with <1% error rate spike" steady_state: - metric: "checkout_success_rate" expected: ">99.5%" - metric: "db_query_duration_p99" expected: "<200ms" injection: type: "network_partition" target: "payment-db-primary" duration: "5 minutes" blast_radius: "single AZ" abort_conditions: - "checkout_success_rate < 95% for > 60 seconds" - "revenue_per_minute drops > 50%" - "any SEV-1 incident declared" results: failover_time: "22 seconds" error_spike: "0.3% for 25 seconds" hypothesis_confirmed: true follow_up_actions: - "Document failover behavior in runbook" - "Add failover time as SLI (target: <30s)" ```

Chaos Engineering Maturity Levels

| Level | What You Test | Tools | |-------|--------------|-------| | 1: Manual | Kill a pod, see what happens | `kubectl delete pod` | | 2: Automated | Scheduled pod kills, network delays | Chaos Monkey, Litmus | | 3: Game Days | Multi-failure scenarios with team exercise | Custom scripts + coordination | | 4: Continuous | Automated chaos in production with auto-rollback | Gremlin, Chaos Mesh |

---

Phase 11: Observability Cost Optimization

Cost Drivers (Ranked)

| # | Driver | Typical % of Bill | Optimization | |---|--------|-------------------|-------------| | 1 | Log volume | 40-60% | Reduce verbosity, drop DEBUG, sample repetitive | | 2 | Metric cardinality | 15-25% | Drop unused metrics, limit labels | | 3 | Trace volume | 10-20% | Sampling, tail-based sampling | | 4 | Retention | 10-15% | Tiered storage (hot → warm → cold) | | 5 | Query cost | 5-10% | Optimize dashboard queries, set max scan limits |

Cost Reduction Checklist

```yaml cost_optimization: logs: - action: "Drop DEBUG/TRACE in production" savings: "30-50% of log volume" - action: "Sample health check logs (1:100)" savings: "5-15% of log volume" - action: "Deduplicate identical error bursts" savings: "10-20% during incidents" - action: "Move logs older than 7 days to S3/cold storage" savings: "60-80% of storage cost" - action: "Drop request/response body logging" savings: "20-40% of log volume" metrics: - action: "Audit unused metrics (no dashboard, no alert)" savings: "10-30% of series" - action: "Reduce histogram bucket count (default 11 → 8)" savings: "~27% of histogram series" - action: "Remove high-cardinality labels" savings: "Variable — can be massive" - action: "Increase scrape interval for non-critical metrics (15s → 60s)" savings: "75% of data points for those metrics" traces: - action: "Implement tail-based sampling" savings: "80-95% of trace volume" - action: "Drop internal health check traces" savings: "5-20% of trace volume" - action: "Reduce span attribute size (truncate long strings)" savings: "10-30% of trace storage" general: - action: "Review and right-size retention policies quarterly" - action: "Set query timeouts and result limits on dashboards" - action: "Use recording rules for expensive queries" ```

Monthly Cost Review Template

```yaml observability_cost_review: month: "February 2026" total_cost: "$X,XXX" breakdown: logs: { volume: "X TB", cost: "$X", pct: "X%" } metrics: { series: "X million", cost: "$X", pct: "X%" } traces: { volume: "X TB", cost: "$X", pct: "X%" } infrastructure: { instances: X, cost: "$X", pct: "X%" } cost_per: request: "$0.000X" service: "$X average" engineer: "$X per engineer" optimizations_applied: [] optimizations_planned: [] budget_status: "on_track | over_budget | under_budget" ```

---

Phase 12: Advanced Patterns

Correlation: Connecting the Three Pillars

``` Every log line includes: trace_id, span_id Every trace span includes: service, operation Every metric includes: service label

Correlation paths: Alert fires (metric) → Click → Dashboard (metric) → Filter by time window → Trace search (same service + time) → Find failing trace → Logs (filter by trace_id) → See exact error Support ticket (user report) → Find request_id in logs → Extract trace_id → View full trace → Identify slow span → Check span's service metrics → Confirm pattern ```

Synthetic Monitoring

```yaml synthetic_checks: - name: "Checkout flow" type: browser frequency: 5m locations: [us-east, eu-west, ap-southeast] steps: - navigate: "https://app.example.com/products" - click: "Add to Cart" - click: "Checkout" - assert: "Order confirmation page loads in <3s" alert_on: "2 consecutive failures from same location" - name: "API health" type: api frequency: 1m endpoints: - url: "https://api.example.com/health" expected_status: 200 max_latency_ms: 500 - url: "https://api.example.com/v1/products?limit=1" expected_status: 200 max_latency_ms: 1000 ```

Feature Flag Observability

```yaml # Correlate feature flags with metrics feature_flag_monitoring: - flag: "new_checkout_flow" metrics_to_compare: - "checkout_conversion_rate" # by flag variant - "checkout_error_rate" - "checkout_latency_p99" alerts: - "If error rate for new variant > 2x control, auto-disable flag" ```

Observability Maturity Model

| Dimension | Level 1 | Level 2 | Level 3 | Level 4 | |-----------|---------|---------|---------|---------| | Logging | Unstructured logs | Structured JSON, centralized | Correlated with traces | Automated log analysis | | Metrics | Basic infra metrics | RED/USE for services | SLO-based with error budgets | Predictive (anomaly detection) | | Tracing | No tracing | Key services instrumented | Full distributed tracing | Trace-driven testing | | Alerting | Static thresholds | Multi-signal alerts | Burn-rate based on SLOs | Auto-remediation | | Incident Response | Ad hoc | Defined process + roles | Post-mortems with action tracking | Chaos engineering in prod | | Culture | "Ops team handles it" | Shared ownership (you build it, you run it) | SLO-driven development velocity | Reliability as a feature |

---

Quality Scoring Rubric (0-100)

| Dimension | Weight | 0 | 5 | 10 | |-----------|--------|---|---|-----| | Logging quality | 15% | Unstructured, no correlation | Structured JSON, missing fields | Full schema, trace correlation, PII scrubbing | | Metrics coverage | 15% | No metrics | RED or USE, not both | RED + USE + business metrics + custom | | Tracing completeness | 10% | No tracing | Key services | Full path, sampling strategy, tail-based | | SLO maturity | 15% | No reliability targets | Informal targets | SLOs with error budgets, burn-rate alerts, weekly review | | Alert quality | 15% | Noisy/missing | Actionable, some runbooks | SLO-based, full runbooks, low false positive | | Incident response | 10% | Ad hoc | Defined process | Full process, roles, post-mortems, chaos engineering | | Dashboard design | 10% | No dashboards | Basic panels | Hierarchical L1-L4, consistent, linked to alerts | | Cost efficiency | 10% | Unknown cost | Tracked | Optimized, reviewed monthly, within budget |

90-100: World-class. Teach others. 70-89: Production-ready. Fill specific gaps. 50-69: Functional but fragile. <50: Significant reliability risk.

---

10 Observability Commandments

  1. Structured or it didn't happen — unstructured logs are technical debt
  2. Correlate everything — trace_id connects logs, traces, and metrics
  3. Alert on symptoms, not causes — users don't care about CPU, they care about latency
  4. Every alert gets a runbook — no runbook = no alert
  5. SLOs drive velocity — error budgets decide when to ship vs stabilize
  6. Dashboards have hierarchy — executives don't need pod CPU graphs
  7. Blameless post-mortems always — blame prevents learning
  8. Cost is a feature — observability that bankrupts you isn't observability
  9. You build it, you run it — the team that ships code owns its observability
  10. Practice failure — chaos engineering builds confidence

---

12 Natural Language Commands

| Command | What It Does | |---------|-------------| | "Audit our observability" | Run the /16 health check, score each dimension, prioritize gaps | | "Design logging for [service]" | Generate structured log schema with context fields for the service | | "Set up metrics for [service]" | Create RED + USE + business metric instrumentation plan | | "Create SLOs for [service]" | Define SLIs, targets, error budgets, and burn-rate alert rules | | "Design alerts for [service]" | Create alert rules with severity, thresholds, and runbook templates | | "Build dashboard for [service]" | Design L2 service overview dashboard with panel specifications | | "Write a runbook for [alert]" | Generate structured runbook with diagnosis steps and fixes | | "Run post-mortem for [incident]" | Generate blameless post-mortem document with timeline and action items | | "Set up on-call for [team]" | Design rotation, escalation policy, handoff checklist | | "Plan chaos experiment for [scenario]" | Design experiment with hypothesis, injection, abort conditions | | "Optimize observability costs" | Audit current spend, identify top savings, create reduction plan | | "Design tracing for [system]" | Create OpenTelemetry instrumentation plan with sampling strategy |

---

⚡ Level Up Your Observability

This skill gives you the methodology. For industry-specific implementation patterns:

🔗 More Free Skills by AfrexAI

  • `afrexai-devops-engine` — CI/CD, infrastructure, deployment strategies
  • `afrexai-api-architect` — API design, security, versioning
  • `afrexai-database-engineering` — Schema design, query optimization, migrations
  • `afrexai-code-reviewer` — Code review methodology with SPEAR framework
  • `afrexai-prompt-engineering` — System prompt design, testing, optimization

Browse all AfrexAI skills: clawhub.com | Full storefront

Use Cases

  • Set up structured logging and distributed tracing for production services
  • Define and implement Service Level Objectives (SLOs) for reliability engineering
  • Build incident response playbooks with clear escalation procedures
  • Score your current observability posture using the built-in health check
  • Design alerting rules that reduce noise and catch real issues

Pros & Cons

Pros

  • +Comprehensive coverage from logging to tracing to SLO-driven development
  • +Includes a 16-point health check for quick observability posture assessment
  • +Covers incident response alongside monitoring — not just data collection

Cons

  • -Methodology and framework only — does not deploy actual monitoring infrastructure
  • -Requires existing familiarity with observability concepts and tooling
  • -No specific tool integrations (Datadog, Grafana, etc.) — platform-agnostic but generic

FAQ

What does Observability & Reliability Engineering do?
Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, buil...
What platforms support Observability & Reliability Engineering?
Observability & Reliability Engineering is available on Claude Code, OpenClaw.
What are the use cases for Observability & Reliability Engineering?
Set up structured logging and distributed tracing for production services. Define and implement Service Level Objectives (SLOs) for reliability engineering. Build incident response playbooks with clear escalation procedures.

100+ free AI tools

Writing, PDF, image, and developer tools — all in your browser.

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.