Logs, metrics, and traces are the three pillars of observability. Each captures a different dimension of system behavior, and together they give engineers the full picture needed to understand, debug, and improve distributed systems.
Logs are timestamped, immutable records of discrete events that occurred inside a system — for example, a user login, an error thrown, or a database query executed. They are typically unstructured or semi-structured text (often JSON) emitted by application code. Logs are the most human-readable signal and are invaluable for understanding exactly what happened at a specific point in time. The key trade-off is that high-volume log ingestion and storage can be expensive and slow to query at scale.
Metrics are numerical measurements collected at regular intervals that represent the state or performance of a system over time — examples include CPU usage, request rate, error rate, and memory consumption. They are highly compressed (just a number, a name, a timestamp, and optional labels), making them extremely cheap to store and fast to query. Metrics power dashboards and alerting because you can aggregate and graph them efficiently. However, metrics lack context: a spike in error rate tells you something is wrong, but not why or where exactly.
A trace represents the end-to-end journey of a single request as it flows through a distributed system, composed of a tree of individual units called spans. Each span records the operation name, start time, duration, and contextual metadata for one unit of work (e.g., an HTTP call, a DB query, a cache lookup). Traces are uniquely powerful for pinpointing latency bottlenecks and understanding causal relationships across microservices. A trace is linked by a shared trace ID propagated through all services via request headers.
Metrics alert you that something is wrong (e.g., p99 latency spiked). Traces let you find which service or operation in the call chain is responsible. Logs then provide the granular event-level detail to understand why the failure occurred. Modern observability platforms like Grafana, Datadog, and Honeycomb correlate all three signals — for example, clicking a slow trace links directly to the relevant log lines.
Avoid logging sensitive data (PII, credentials) — scrub or redact at the source, not after the fact. For metrics, use consistent naming conventions and limit high-cardinality label values (e.g., never use raw user IDs as labels) to prevent a 'cardinality explosion' that crashes your metrics backend. For traces, always propagate the trace context (W3C TraceContext or B3 headers) across every service boundary, including async message queues, or you will break the trace chain. Start with the USE method (Utilization, Saturation, Errors) for metrics and the RED method (Rate, Errors, Duration) for service-level health.
© RM Full Stack & AI Engineer · All guides · Roadmaps · Open the app