Logging Observability
VerifiedStructured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.
$ Add to .claude/skills/ About This Skill
# Logging & Observability
Patterns for building observable systems across the three pillars: logs, metrics, and traces.
Three Pillars
| Pillar | Purpose | Question It Answers | Example | |--------|---------|---------------------|---------| | Logs | What happened | Why did this request fail? | `{"level":"error","msg":"payment declined","user_id":"u_82"}` | | Metrics | How much / how fast | Is latency increasing? | `http_request_duration_seconds{route="/api/orders"} 0.342` | | Traces | Request flow | Where is the bottleneck? | Span: `api-gateway → auth → order-service → db` |
Each pillar is strongest when correlated. Embed `trace_id` in every log line to jump from a log entry to the full distributed trace.
---
Structured Logging
Always emit logs as structured JSON — never free-text strings.
Required Fields
| Field | Purpose | Required | |-------|---------|----------| | `timestamp` | ISO-8601 with milliseconds | Yes | | `level` | Severity (DEBUG … FATAL) | Yes | | `service` | Originating service name | Yes | | `message` | Human-readable description | Yes | | `trace_id` | Distributed trace correlation | Yes | | `span_id` | Current span within trace | Yes | | `correlation_id` | Business-level correlation (order ID) | When applicable | | `error` | Structured error object | On errors | | `context` | Request-specific metadata | Recommended |
Context Enrichment
Attach context at the middleware level so downstream logs inherit automatically:
```typescript app.use((req, res, next) => { const ctx = { trace_id: req.headers['x-trace-id'] || crypto.randomUUID(), request_id: crypto.randomUUID(), user_id: req.user?.id, method: req.method, path: req.path, }; asyncLocalStorage.run(ctx, () => next()); }); ```
Library Recommendations
| Library | Language | Strengths | Perf | |---------|----------|-----------|------| | Pino | Node.js | Fastest Node logger, low overhead | Excellent | | structlog | Python | Composable processors, context binding | Good | | zerolog | Go | Zero-allocation JSON logging | Excellent | | zap | Go | High performance, typed fields | Excellent | | tracing | Rust | Spans + events, async-aware | Excellent |
Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.
---
Log Levels
| Level | When to Use | Example | |-------|-------------|---------| | FATAL | App cannot continue, process will exit | Database connection pool exhausted | | ERROR | Operation failed, needs attention | Payment charge failed: CARD_DECLINED | | WARN | Unexpected but recoverable | Retry 2/3 for upstream timeout | | INFO | Normal business events | Order ORD-1234 placed successfully | | DEBUG | Developer troubleshooting | Cache miss for key user:82:preferences | | TRACE | Very fine-grained (rarely in prod) | Entering validateAddress with payload |
Rules: Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.
---
Distributed Tracing
OpenTelemetry Setup
Always prefer OpenTelemetry over vendor-specific SDKs:
```typescript import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({ serviceName: 'order-service', traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start(); ```
Span Creation
```typescript const tracer = trace.getTracer('order-service');
async function processOrder(order: Order) { return tracer.startActiveSpan('processOrder', async (span) => { try { span.setAttribute('order.id', order.id); span.setAttribute('order.total_cents', order.totalCents); await validateInventory(order); await chargePayment(order); span.setStatus({ code: SpanStatusCode.OK }); } catch (err) { span.setStatus({ code: SpanStatusCode.ERROR, message: err.message }); span.recordException(err); throw err; } finally { span.end(); } }); } ```
Context Propagation
- Use W3C Trace Context (`traceparent` header) — default in OTel
- Propagate across HTTP, gRPC, and message queues
- For async workers: serialise `traceparent` into the job payload
Trace Sampling
| Strategy | Use When | |----------|----------| | Always On | Low-traffic services, debugging | | Probabilistic (N%) | General production use | | Rate-limited (N/sec) | High-throughput services | | Tail-based | When you need all error traces |
Always sample 100% of error traces regardless of strategy.
---
Metrics Collection
RED Method (Request-Driven)
Monitor these three for every service endpoint:
| Metric | What It Measures | Prometheus Example | |--------|-----------------|-------------------| | Rate | Requests/sec | `rate(http_requests_total[5m])` | | Errors | Failed request ratio | `rate(http_requests_total{status=~"5.."}[5m])` | | Duration | Response time | `histogram_quantile(0.99, http_request_duration_seconds)` |
USE Method (Resource-Driven)
For infrastructure components (CPU, memory, disk, network):
| Metric | What It Measures | Example | |--------|-----------------|---------| | Utilization | % resource busy | CPU usage at 78% | | Saturation | Work queued/waiting | 12 requests queued in thread pool | | Errors | Error events on resource | 3 disk I/O errors in last minute |
---
Monitoring Stack
| Tool | Category | Best For | |------|----------|----------| | Prometheus | Metrics | Pull-based metrics, alerting rules | | Grafana | Visualisation | Dashboards for metrics, logs, traces | | Jaeger | Tracing | Distributed trace visualisation | | Loki | Logs | Log aggregation (pairs with Grafana) | | OpenTelemetry | Collection | Vendor-neutral telemetry collection |
Recommendation: Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.
---
Alert Design
Severity Levels
| Severity | Response Time | Example | |----------|---------------|---------| | P1 | Immediate | Service fully down, data loss | | P2 | < 30 min | Error rate > 5%, latency p99 > 5s | | P3 | Business hours | Disk > 80%, cert expiring in 7 days | | P4 | Best effort | Non-critical deprecation warning |
Alert Fatigue Prevention
- Alert on symptoms, not causes — "error rate > 5%" not "pod restarted"
- Multi-window, multi-burn-rate — catch both sudden spikes and slow burns
- Require runbook links — every alert must link to diagnosis and remediation
- Review monthly — delete or tune alerts that never fire or always fire
- Group related alerts — use inhibition rules to suppress child alerts
- Set appropriate thresholds — if alert fires daily and is ignored, raise threshold or delete
---
Dashboard Patterns
Overview Dashboard ("War Room") - Total requests/sec across all services - Global error rate (%) with trendline - p50 / p95 / p99 latency - Active alerts count by severity - Deployment markers overlaid on graphs
Service Dashboard (Per-Service) - RED metrics for each endpoint - Dependency health (upstream/downstream success rates) - Resource utilisation (CPU, memory, connections) - Top errors table with count and last seen
---
Observability Checklist
Every service must have:
- [ ] Structured JSON logging with consistent schema
- [ ] Correlation / trace IDs propagated on all requests
- [ ] RED metrics exposed for every external endpoint
- [ ] Health check endpoints (`/healthz` and `/readyz`)
- [ ] Distributed tracing with OpenTelemetry
- [ ] Dashboards for RED metrics and resource utilisation
- [ ] Alerts for error rate, latency, and saturation with runbook links
- [ ] Log level configurable at runtime without redeployment
- [ ] PII scrubbing verified and tested
- [ ] Retention policies defined for logs, metrics, and traces
Anti-Patterns
| Anti-Pattern | Problem | Fix | |-------------|---------|-----| | Logging PII | Privacy/compliance violation | Mask or exclude PII; use token references | | Excessive logging | Storage costs balloon, signal drowns | Log business events, not data flow | | Unstructured logs | Cannot query or alert on fields | Use structured JSON with consistent schema | | String interpolation | Breaks structured fields, injection risk | Pass fields as metadata, not in message | | Missing correlation IDs | Cannot trace across services | Generate and propagate trace_id everywhere | | Alert storms | On-call fatigue, real issues buried | Use grouping, inhibition, deduplication | | Metrics with high cardinality | Prometheus OOM, dashboard timeouts | Never use user ID or request ID as label |
NEVER Do
- NEVER log passwords, tokens, API keys, or secrets — even at DEBUG level
- NEVER use console.log / print in production — use a structured logger
- NEVER use user IDs, emails, or request IDs as metric labels — cardinality will explode
- NEVER create alerts without a runbook link — unactionable alerts erode trust
- NEVER rely on logs alone — you need metrics and traces for full observability
- NEVER log request/response bodies by default — opt-in only, with PII redaction
- NEVER ignore log volume — set budgets and alert when a service exceeds daily quota
- NEVER skip context propagation in async flows — broken traces are worse than no traces
Use Cases
- Implement structured logging patterns for distributed systems
- Set up distributed tracing with OpenTelemetry for microservices
- Configure metrics collection for system observability dashboards
- Build observable systems with correlated logs, traces, and metrics
- Design logging infrastructure for debugging production issues
Pros & Cons
Pros
- +Compatible with multiple platforms including claude-code, openclaw
- +Well-documented with detailed usage instructions and examples
- +Strong community adoption with a large number of downloads
- +Open source with permissive licensing
Cons
- -Requires API tokens or authentication setup before first use
- -No built-in analytics or usage metrics dashboard
FAQ
What does Logging Observability do?
What platforms support Logging Observability?
What are the use cases for Logging Observability?
100+ free AI tools
Writing, PDF, image, and developer tools — all in your browser.
Next Step
Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.