Observability is the ability to understand the internal state of a system by examining the data it produces externally. It goes beyond traditional monitoring by enabling engineers to ask arbitrary questions about system behavior without deploying new code.
Observability is a property of a system that measures how well you can infer its internal state from its outputs. Coined in control theory by Rudolf Kálmán, the term was adopted by software engineering to describe how understandable distributed systems are at runtime. A highly observable system lets engineers diagnose novel, unexpected failures — not just the ones they anticipated when writing alerts.
Logs are timestamped, immutable records of discrete events emitted by an application. Metrics are numeric measurements aggregated over time, such as request rates, error counts, and latency percentiles. Distributed traces connect a chain of operations across multiple services into a single end-to-end request timeline, making them essential for pinpointing latency bottlenecks in microservice architectures.
Instrumentation is the foundation: developers embed telemetry collection points into code using libraries like OpenTelemetry, which standardizes how signals are generated and exported. Telemetry data is sent to a backend platform — such as Grafana, Datadog, Honeycomb, or Jaeger — where it can be queried and visualized. The key distinction is that observability platforms support high-cardinality, high-dimensionality queries, letting you slice data by any combination of attributes like user ID, region, or build version.
Monitoring answers predefined questions: 'Is CPU above 90%?' or 'Is the error rate above 1%?' Observability answers open-ended questions: 'Why are only users in the EU on Chrome experiencing slow checkouts?' Monitoring is still valuable and complements observability, but it assumes you know what can go wrong in advance, whereas observability is designed for unknown unknowns.
OpenTelemetry (OTel) is a CNCF project that provides a vendor-neutral SDK, API, and collector for generating and exporting logs, metrics, and traces. It replaces fragmented, vendor-specific agents and lets teams switch observability backends without re-instrumenting their code. Auto-instrumentation libraries for languages like Java, Python, and Node.js can capture telemetry with minimal code changes.
High-cardinality data — fields with millions of unique values like user IDs or session tokens — is extremely powerful for debugging but can be very expensive in metric-based systems, sometimes causing backend databases to crash or bills to spike unexpectedly. The best practice is to store high-cardinality data in trace and log backends, which are built for it, and keep metric labels low-cardinality. Always set sampling strategies for traces in high-traffic systems to control data volume without losing diagnostic value.
© RM Full Stack & AI Engineer · All guides · Roadmaps · Open the app