Observability Explained

Observability is the ability to understand the internal state of a system by examining the data it produces externally. It goes beyond traditional monitoring by enabling engineers to ask arbitrary questions about system behavior without deploying new code.

What Is Observability?

Observability is a property of a system that measures how well you can infer its internal state from its outputs. Coined in control theory by Rudolf Kálmán, the term was adopted by software engineering to describe how understandable distributed systems are at runtime. A highly observable system lets engineers diagnose novel, unexpected failures — not just the ones they anticipated when writing alerts.

The Three Pillars: Logs, Metrics, and Traces

Logs are timestamped, immutable records of discrete events emitted by an application. Metrics are numeric measurements aggregated over time, such as request rates, error counts, and latency percentiles. Distributed traces connect a chain of operations across multiple services into a single end-to-end request timeline, making them essential for pinpointing latency bottlenecks in microservice architectures.

How Observability Works in Practice

Instrumentation is the foundation: developers embed telemetry collection points into code using libraries like OpenTelemetry, which standardizes how signals are generated and exported. Telemetry data is sent to a backend platform — such as Grafana, Datadog, Honeycomb, or Jaeger — where it can be queried and visualized. The key distinction is that observability platforms support high-cardinality, high-dimensionality queries, letting you slice data by any combination of attributes like user ID, region, or build version.

Observability vs. Monitoring

Monitoring answers predefined questions: 'Is CPU above 90%?' or 'Is the error rate above 1%?' Observability answers open-ended questions: 'Why are only users in the EU on Chrome experiencing slow checkouts?' Monitoring is still valuable and complements observability, but it assumes you know what can go wrong in advance, whereas observability is designed for unknown unknowns.

OpenTelemetry: The Industry Standard

OpenTelemetry (OTel) is a CNCF project that provides a vendor-neutral SDK, API, and collector for generating and exporting logs, metrics, and traces. It replaces fragmented, vendor-specific agents and lets teams switch observability backends without re-instrumenting their code. Auto-instrumentation libraries for languages like Java, Python, and Node.js can capture telemetry with minimal code changes.

Key Gotcha: Cardinality and Cost

High-cardinality data — fields with millions of unique values like user IDs or session tokens — is extremely powerful for debugging but can be very expensive in metric-based systems, sometimes causing backend databases to crash or bills to spike unexpectedly. The best practice is to store high-cardinality data in trace and log backends, which are built for it, and keep metric labels low-cardinality. Always set sampling strategies for traces in high-traffic systems to control data volume without losing diagnostic value.

Go deeper with an AI tutor that teaches this in context — and quizzes you on it.

Open the app — free to start