Logs, metrics, traces, and dashboards for understanding system behavior.
Observability rests on three complementary signal types: logs (discrete events), metrics (aggregated measurements), and traces (request-scoped causal chains). Together they let operators answer 'what is happening, how much, and why' in distributed systems without deploying new code.
Distributed tracing propagates a unique trace ID through every service in a request's path, recording spans (timed operations) that form a causal DAG. It answers the question 'where did the time go?' for any individual request across a microservice architecture.
Service Level Indicators (SLIs) measure system behavior, Service Level Objectives (SLOs) set targets for those indicators, and Service Level Agreements (SLAs) are contractual commitments with consequences. Together they form the reliability contract between a service and its users.
Alerting converts observability signals into actionable notifications. Effective alerting is symptom-based (alert on user impact, not internal metrics), respects severity tiers, and integrates with on-call rotation and incident management to minimize noise and maximize response speed.
Health checks (liveness probes) verify a process is running and not deadlocked. Readiness probes verify a service can handle traffic. Startup probes give slow-starting services time to initialize. Together they enable load balancers, orchestrators, and service meshes to route traffic only to healthy instances.
OpenTelemetry (OTel) is the CNCF standard for vendor-neutral observability instrumentation. It provides a unified API, SDK, and Collector for generating, processing, and exporting logs, metrics, and traces from any application to any backend.
Log aggregation collects, indexes, and makes searchable the logs from hundreds or thousands of distributed service instances in a centralized platform. It transforms ephemeral per-container stdout into a durable, queryable forensic record.