Logging Observability

Verified

Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.

589 downloads

$ Add to .claude/skills/

$ openclaw install

About This Skill

# Logging & Observability

Patterns for building observable systems across the three pillars: logs, metrics, and traces.

Three Pillars

| Pillar | Purpose | Question It Answers | Example | |--------|---------|---------------------|---------| | Logs | What happened | Why did this request fail? | `{"level":"error","msg":"payment declined","user_id":"u_82"}` | | Metrics | How much / how fast | Is latency increasing? | `http_request_duration_seconds{route="/api/orders"} 0.342` | | Traces | Request flow | Where is the bottleneck? | Span: `api-gateway → auth → order-service → db` |

Each pillar is strongest when correlated. Embed `trace_id` in every log line to jump from a log entry to the full distributed trace.

---

Structured Logging

Always emit logs as structured JSON — never free-text strings.

Required Fields

| Field | Purpose | Required | |-------|---------|----------| | `timestamp` | ISO-8601 with milliseconds | Yes | | `level` | Severity (DEBUG … FATAL) | Yes | | `service` | Originating service name | Yes | | `message` | Human-readable description | Yes | | `trace_id` | Distributed trace correlation | Yes | | `span_id` | Current span within trace | Yes | | `correlation_id` | Business-level correlation (order ID) | When applicable | | `error` | Structured error object | On errors | | `context` | Request-specific metadata | Recommended |

Context Enrichment

Attach context at the middleware level so downstream logs inherit automatically:

```typescript app.use((req, res, next) => { const ctx = { trace_id: req.headers['x-trace-id'] || crypto.randomUUID(), request_id: crypto.randomUUID(), user_id: req.user?.id, method: req.method, path: req.path, }; asyncLocalStorage.run(ctx, () => next()); }); ```

Library Recommendations

| Library | Language | Strengths | Perf | |---------|----------|-----------|------| | Pino | Node.js | Fastest Node logger, low overhead | Excellent | | structlog | Python | Composable processors, context binding | Good | | zerolog | Go | Zero-allocation JSON logging | Excellent | | zap | Go | High performance, typed fields | Excellent | | tracing | Rust | Spans + events, async-aware | Excellent |

Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.

---

Log Levels

| Level | When to Use | Example | |-------|-------------|---------| | FATAL | App cannot continue, process will exit | Database connection pool exhausted | | ERROR | Operation failed, needs attention | Payment charge failed: CARD_DECLINED | | WARN | Unexpected but recoverable | Retry 2/3 for upstream timeout | | INFO | Normal business events | Order ORD-1234 placed successfully | | DEBUG | Developer troubleshooting | Cache miss for key user:82:preferences | | TRACE | Very fine-grained (rarely in prod) | Entering validateAddress with payload |

Rules: Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.

---

Distributed Tracing

OpenTelemetry Setup

Always prefer OpenTelemetry over vendor-specific SDKs:

```typescript import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({ serviceName: 'order-service', traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start(); ```

Span Creation

```typescript const tracer = trace.getTracer('order-service');

async function processOrder(order: Order) { return tracer.startActiveSpan('processOrder', async (span) => { try { span.setAttribute('order.id', order.id); span.setAttribute('order.total_cents', order.totalCents); await validateInventory(order); await chargePayment(order); span.setStatus({ code: SpanStatusCode.OK }); } catch (err) { span.setStatus({ code: SpanStatusCode.ERROR, message: err.message }); span.recordException(err); throw err; } finally { span.end(); } }); } ```

Context Propagation

Use W3C Trace Context (`traceparent` header) — default in OTel
Propagate across HTTP, gRPC, and message queues
For async workers: serialise `traceparent` into the job payload

Trace Sampling

| Strategy | Use When | |----------|----------| | Always On | Low-traffic services, debugging | | Probabilistic (N%) | General production use | | Rate-limited (N/sec) | High-throughput services | | Tail-based | When you need all error traces |

Always sample 100% of error traces regardless of strategy.

---

Metrics Collection

RED Method (Request-Driven)

Monitor these three for every service endpoint:

| Metric | What It Measures | Prometheus Example | |--------|-----------------|-------------------| | Rate | Requests/sec | `rate(http_requests_total[5m])` | | Errors | Failed request ratio | `rate(http_requests_total{status=~"5.."}[5m])` | | Duration | Response time | `histogram_quantile(0.99, http_request_duration_seconds)` |

USE Method (Resource-Driven)

For infrastructure components (CPU, memory, disk, network):

| Metric | What It Measures | Example | |--------|-----------------|---------| | Utilization | % resource busy | CPU usage at 78% | | Saturation | Work queued/waiting | 12 requests queued in thread pool | | Errors | Error events on resource | 3 disk I/O errors in last minute |

---

Monitoring Stack

| Tool | Category | Best For | |------|----------|----------| | Prometheus | Metrics | Pull-based metrics, alerting rules | | Grafana | Visualisation | Dashboards for metrics, logs, traces | | Jaeger | Tracing | Distributed trace visualisation | | Loki | Logs | Log aggregation (pairs with Grafana) | | OpenTelemetry | Collection | Vendor-neutral telemetry collection |

Recommendation: Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.

---

Alert Design

Severity Levels

| Severity | Response Time | Example | |----------|---------------|---------| | P1 | Immediate | Service fully down, data loss | | P2 | < 30 min | Error rate > 5%, latency p99 > 5s | | P3 | Business hours | Disk > 80%, cert expiring in 7 days | | P4 | Best effort | Non-critical deprecation warning |

Alert Fatigue Prevention

Alert on symptoms, not causes — "error rate > 5%" not "pod restarted"
Multi-window, multi-burn-rate — catch both sudden spikes and slow burns
Require runbook links — every alert must link to diagnosis and remediation
Review monthly — delete or tune alerts that never fire or always fire
Group related alerts — use inhibition rules to suppress child alerts
Set appropriate thresholds — if alert fires daily and is ignored, raise threshold or delete

---

Dashboard Patterns

Overview Dashboard ("War Room") - Total requests/sec across all services - Global error rate (%) with trendline - p50 / p95 / p99 latency - Active alerts count by severity - Deployment markers overlaid on graphs

Service Dashboard (Per-Service) - RED metrics for each endpoint - Dependency health (upstream/downstream success rates) - Resource utilisation (CPU, memory, connections) - Top errors table with count and last seen

---

Observability Checklist

Every service must have:

[ ] Structured JSON logging with consistent schema
[ ] Correlation / trace IDs propagated on all requests
[ ] RED metrics exposed for every external endpoint
[ ] Health check endpoints (`/healthz` and `/readyz`)
[ ] Distributed tracing with OpenTelemetry
[ ] Dashboards for RED metrics and resource utilisation
[ ] Alerts for error rate, latency, and saturation with runbook links
[ ] Log level configurable at runtime without redeployment
[ ] PII scrubbing verified and tested
[ ] Retention policies defined for logs, metrics, and traces

Anti-Patterns

| Anti-Pattern | Problem | Fix | |-------------|---------|-----| | Logging PII | Privacy/compliance violation | Mask or exclude PII; use token references | | Excessive logging | Storage costs balloon, signal drowns | Log business events, not data flow | | Unstructured logs | Cannot query or alert on fields | Use structured JSON with consistent schema | | String interpolation | Breaks structured fields, injection risk | Pass fields as metadata, not in message | | Missing correlation IDs | Cannot trace across services | Generate and propagate trace_id everywhere | | Alert storms | On-call fatigue, real issues buried | Use grouping, inhibition, deduplication | | Metrics with high cardinality | Prometheus OOM, dashboard timeouts | Never use user ID or request ID as label |

NEVER Do

NEVER log passwords, tokens, API keys, or secrets — even at DEBUG level
NEVER use console.log / print in production — use a structured logger
NEVER use user IDs, emails, or request IDs as metric labels — cardinality will explode
NEVER create alerts without a runbook link — unactionable alerts erode trust
NEVER rely on logs alone — you need metrics and traces for full observability
NEVER log request/response bodies by default — opt-in only, with PII redaction
NEVER ignore log volume — set budgets and alert when a service exceeds daily quota
NEVER skip context propagation in async flows — broken traces are worse than no traces

Use Cases

Implement structured logging patterns for distributed systems
Set up distributed tracing with OpenTelemetry for microservices
Configure metrics collection for system observability dashboards
Build observable systems with correlated logs, traces, and metrics
Design logging infrastructure for debugging production issues

Pros & Cons

Pros

+Compatible with multiple platforms including claude-code, openclaw
+Well-documented with detailed usage instructions and examples
+Strong community adoption with a large number of downloads
+Open source with permissive licensing

Cons

-Requires API tokens or authentication setup before first use
-No built-in analytics or usage metrics dashboard

FAQ

What does Logging Observability do?

What platforms support Logging Observability?

Logging Observability is available on Claude Code, OpenClaw.

What are the use cases for Logging Observability?

Implement structured logging patterns for distributed systems. Set up distributed tracing with OpenTelemetry for microservices. Configure metrics collection for system observability dashboards.

100+ free AI tools

Writing, PDF, image, and developer tools — all in your browser.

AI Humanizer

Make AI text undetectable

AI Detector

Free, unlimited

PDF Tools

Merge, split, compress

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.

Open Free Tools Try AI Detector