Over recent years, the software and cloud engineering ecosystem has moved away from monoliths and into the distributed world. Engineers realized that fragmented, specialized services (known as microservices) could scale better and add the necessary decoupling that could enable better and more effective maintenance, upgrades, and scaling.
However, developers and engineers soon realized that the number of wires necessary to connect a distributed environment could multiply alarmingly quickly.
Tooling soon emerged, promising to simplify, organize, and automate the process (e.g., Kubernetes). We were ushered into a new era of previously unheard-of solutions on a global scale — from content delivery to ordering a ride and watching it arrive in real time.
This blog will explore observability, discussing practical solutions, metrics, and various implementation challenges.
What we’ll cover:
- Observability vs. monitoring: key distinctions
- Three pillars of observability
- Implementing observability with OpenTelemetry
- Types of observability metrics monitored
- Challenges and considerations when implementing observability metrics
What is observability?
Observability is the ability to measure a system’s internal state based on the data it produces, typically through logs, metrics, and traces. It enables teams to detect, investigate, and resolve issues in complex, distributed environments.
In DevOps and cloud-native systems, observability goes beyond simple monitoring by providing contextual insights into why a system behaves a certain way, not just that something went wrong. Logs provide detailed event records, metrics offer quantifiable data over time, and traces show the flow of requests across services.
Observability is a cross-cutting concern that permeates every piece of code or infrastructure. This means it should not interfere with business logic or affect how infrastructure is architected. Yet, it must function as an omnipresent, all-seeing eye, “spying” in real-time on what happens within your distributed architecture. We’ll see how this is implemented in practice in a later section.
The software industry has been working feverishly to develop tools and practices to simplify the operation of complex systems and prevent disasters. Automated CI/CD pipelines guard against broken code, infrastructure-as-code (IaC) pipelines set up repeatable, predictable environments, and self-healing mechanisms in orchestrators help maintain workload availability.
Monitoring is a practice that involves defining and tracking various metrics within a system to ensure it operates within its known parameters. It’s very useful and, together with automated alerts, can help engineers understand what goes wrong and when. Anomalies and patterns in memory usage, CPU time, response time, and overall latency can be indicators.
However, the very fact that you need to predefine the metrics and their black-box nature is also their limitation. This is why the industry focus is expanding to include not only the what and the when but also the why and the how, recognizing monitoring as a subset of observability.
Observability and monitoring are two related practices with subtle differences that can increase confidence in systems and allow for a proactive approach to systemic failures. Although unexpected downtime can have serious consequences, observability is often still treated as an afterthought.
Read more: Observability vs Monitoring: Key Differences Explained
Observability relies on correlating metadata from all services of your network and then presenting them in a palatable, interactive fashion so that engineers and DevOps can observe, investigate, and understand the behavior of the system even when it’s running nominally.
So, what is that metadata? And how do we correlate them? Three observability pillars are referenced consistently throughout the literature:
- Metrics
- Logs
- Traces
Some others, such as events and profiles, are referenced sporadically but can, in theory, add even more context and observability to a system.
1. Metrics
Observability metrics are quantitative data points collected from systems to monitor their performance, health, and behavior. They help engineers understand how a system operates over time and identify anomalies or failures.
Metrics can be defined at various levels, including:
- Container level (CPU, disk, and memory usage)
- Deployment level
- Application level (response time and error rates)
- Service level (availability and service status)
- and others
These metrics typically include CPU usage, memory consumption, disk I/O, network traffic, request rates, error rates, and latency.
Metrics are time-series data optimized for storage and querying over time and are usually aggregated and visualized in monitoring tools like Prometheus, Grafana, or Datadog. Unlike logs and traces, which provide detailed or contextual information, metrics offer fast, high-level insights ideal for alerting and trend analysis.
Below is an example of three microservices exposing a /metrics endpoint, scraped by Prometheus, sending the data to remote storage, and then visualizing diagrams and metrics with a plotting tool, such as Grafana:
2. Logs
Logs are detailed, time-stamped records of specific system events. They can be classified into levels (e.g., debug logs enabled while developing, warnings and errors always enabled, or various info logs indicating useful information or auditing).
Logs are a great way to observe a system and its health. However, logging for distributed architectures can quickly become overwhelming as a request typically travels through multiple services before a response or action can be performed.
Furthermore, because logs are normally emitted from all sorts of unrelated components and there’s no agreed-upon structure, logs aren’t natively correlated — advances in observability aim to change that.
Here’s an example of three services writing logs to a file. Filebeat, a lightweight log shipper, collects, parses, and forwards log data to an indexed search engine, such as ElasticSearch:
3. Traces
Traces are signals that track end-to-end a distributed system’s requests as they flow through multiple services. They are a handy mechanism for following the requests as they flow from service to service, as well as the path they take inside your services, if those are correctly instrumented.
Example of traces visualization as a waterfall diagram:
The insight gained from correctly applied tracing is immense, but implementing correct tracing end-to-end is also extremely challenging. Some components emit traces, and some do not, which can lead to “broken” or opaque traces with limited usefulness.
Luckily, the industry is converging on open, vendor-agnostic standards for tracing. We have seen instrumentation libraries emerge for almost all modern programming languages, as well as plugins that can “enable tracing” for common services present across different layers (e.g., load balancers, databases, caches, and other services).
As systems grew, it became clear that a universal, standardized, vendor-agnostic format was needed to express metrics, logs, and traces, allowing them to be directed to the appropriate location and further processed in a streamlined manner.
This has led to a massive undertaking called OpenTelemetry. With two existing alternatives, OpenTracing and OpenCensus, the projects soon realized that their equal status was dividing the developer community in a way that was the exact opposite of what they were both trying to achieve. The projects merged, and OpenTelemetry was born.
It was accepted into CNCF in 2019 and moved to the incubating maturity level in 2021. OpenTelemetry has since become the industry’s de facto standard and is interoperable with multiple observability backends (Elastic, Datadog, New Relic, Honeycomb, Prometheus, Grafana, etc).
Observability instrumentation
1. Automatic and zero-code instrumentation
Automatic instrumentation provides immediate visibility without code changes. It captures data from HTTP requests, database queries, and external API calls. This approach casts a wide net, automatically collecting telemetry from common libraries and frameworks.
While it offers broad coverage, it lacks the context specific to your business logic. Automatic instrumentation is a great way to get started, but it’s best to move to more advanced instrumentation to customize it according to your needs.
2. Manual and programmatic instrumentation
Manual instrumentation fills the gaps that automatic instrumentation cannot reach. It gives you more control over the implementation, allows you to capture custom metrics that you might be missing, and allows you to remove what you don’t need. This approach requires adding code snippets to generate traces, metrics, and logs.
The key question becomes: What deserves manual instrumentation? Focus on critical business workflows, error conditions, and performance bottlenecks unique to your application.
Another approach, such as the programmatic instrumentation, offers a compromise between automatic and manual approaches. It provides greater control over the instrumentation process while requiring minimal code changes.
This method works particularly well when you need specific instrumentation capabilities that automatic approaches cannot offer.
Data collection
For development and small-scale environments, direct export to backends works well and can provide quick value without additional infrastructure. Production environments benefit from collectors.
Collectors handle retries, batching, encryption, and filtering of sensitive data. They offload data processing from your applications, reducing resource consumption.
The collector acts as a central processing hub. It receives telemetry data through receivers, processes it with processors, and exports it through exporters. This architecture provides flexibility in routing data to multiple destinations.
As mentioned earlier, observability is a cross-cutting concern, meaning that it will be present everywhere and involve everyone. It’s not one of these things you can partially do. So, how do we implement that?
Let’s take a look at an implementation guide below:
- Starting from the top, managers should understand the benefits of such an undertaking, respect the effort, and explicitly allocate resources across the team, treating this as an equal-caliber project and not as an afterthought.
- Architects and product managers should lay out the enterprise architecture and deployment charts, and pinpoint:
- All custom application code
- All backend services (databases, message brokers, caching layers)
- Third-party dependencies (external or vendor-provided components, such as payment gateways, identity providers, or other APIs)
- Cluster services
- Networking and security layers and map out the core business flows
- Developers should implement telemetry through the language’s ready-made SDKs or auto-instrumentation libraries and define, through configuration, the OTel endpoint where data will be posted.
- DevOps and platform teams should provision the correct exporters and collector services to consume the telemetry data, deploy one or more observability backends to visualize traces and logs, and do the necessary wiring.
Engineers should also enable other relevant plugins, such as the otel module for Nginx or pgotel for PostgreSQL. Similarly, foundational components, such as the Kubernetes clusters themselves, should be configured to emit telemetry.
- Lastly, QA, development, and DevOps teams should familiarize themselves with the rich observability signals and functionality that a successful OpenTelemetry implementation provides and enhance their workflows accordingly by customizing their preferred backed solution according to their needs.
A well-configured observability setup like the one above might seem like a lot of work, but it almost always pays back. With minimal training and relying on user-friendly observability backends, individuals wearing various hats can explore data.
QA or management can examine response time trends or failed transaction rates to understand a particular trend.
Developers can try to understand the happy and unhappy paths of a long request flow without separately monitoring the logs of every container. Infrastructure engineers can also monitor the CPU or memory usage of particular pods or the behavior of their auto-scaling.
A support team member can quickly pinpoint what went wrong, starting to investigate from a particular trace-id
, which is the value elected and placed in the traceparent
header. This process enables the collector and backend to assemble and correlate all the telemetry data for that request, presenting it in a friendly, palpable, and investigable way.
Best practices for observability implementation include:
- Start simple – Experiment with automatic, zero-code instrumentation before adding code-based instrumentation. This approach provides immediate value while you identify specific areas needing custom instrumentation.
- Utilize context and attributes – Through attributes, you can add context to your telemetry data. Context helps you understand the circumstances under which metrics are collected. Include environment details, user information, and operation specifics. However, avoid attribute bloat. Excessive attributes increase data volume and complicate analysis.
- Monitor performance impact – You should constantly monitor the performance impact of instrumentation, as OpenTelemetry could potentially add overhead to your applications. Try to use batching to reduce network overhead. Batch processors collect multiple spans before exporting the data. Consider also the cardinality of different measurement types.
Cardinality is the number of unique combinations of attributes. Higher cardinality provides more granular insights, but it also adds complexity and cost. Another area to consider is sampling, as high-throughput systems tend to generate massive amounts of telemetry data, which can be costly to manage.
What metrics matter most when your system faces an unexpected load? Which signals indicate problems before users become aware of them?
Determining the right observability metrics for your specific case enables proactive system management rather than reactive firefighting.
Typical observability metrics include:
- Latency
- Throughput and request rates
- Error rates
- CPU and memory usage
- Disk I/O and storage performance
- Network performance metrics
- Service level objectives
- User experience correlation
1. Latency
Latency measures the time between request initiation and response completion. This metric directly impacts user satisfaction and business outcomes.
Two critical latency types require monitoring: request latency for individual operations and end-to-end latency for complete user workflows. High latency often signals bottlenecks in your system architecture.
2. Throughput and request rates
Throughput measures the volume of work your system completes over time. It counts the number of requests, transactions, or tasks completed successfully. Request rates specifically track incoming demand against your system’s processing capacity.
These metrics help identify capacity limits and scaling requirements. Monitoring request patterns reveals user behavior trends. Peak traffic hours require different resource allocation than quiet periods.
3. Error rates
Error rates measure the percentage of failed requests against total requests. This metric directly correlates with system reliability and user satisfaction. Different error types require different response strategies.
What constitutes an acceptable error rate? Industry standards typically target error rates of less than 0.1% for critical services. However, acceptable thresholds depend on your specific business requirements and user expectations.
4. CPU and memory usage
CPU utilization measures the processor’s workload against its available capacity. High CPU usage indicates computational bottlenecks or inefficient code execution. Memory utilization tracks RAM consumption and potential memory leaks.
Sustained high CPU usage leads to potential latency and system instability. Memory exhaustion causes application crashes and service degradation. Both metrics require proactive monitoring with appropriate alerting thresholds.
With modern approaches, observability systems can also automatically adjust CPU and memory requests and limits in Kubernetes environments, for example.
5. Disk I/O and storage performance
Disk I/O measures read and write operations against storage systems. High disk utilization creates bottlenecks that affect overall system performance. Storage metrics include both throughput (bytes per second) and operation rates (IOPS).
Database-heavy applications particularly benefit from disk I/O monitoring. Slow disk performance cascades through application layers, affecting user experience.
6. Network performance metrics
Network utilization measures bandwidth consumption against available capacity. Packet loss and network errors indicate connectivity problems. Modern microservices architectures depend heavily on network performance.
Newer approaches depend on service mesh implementations that provide detailed network monitoring to identify communication bottlenecks.
7. Service level objectives
Service Level Objectives (SLOs) establish internal target performance levels and benchmarks, providing quantifiable targets that guide engineering decisions and infrastructure design.
SLOs create accountability and drive engineering priorities. Missing SLOs trigger investigation and improvement efforts.
8. User experience correlation
Technical metrics gain meaning through correlation with user experience. High latency correlates with user frustration and abandonment, and error rates directly impact user satisfaction and retention.
Application Performance Index (Apdex) scores quantify user satisfaction based on response times. Apdex translates technical performance into business-relevant satisfaction measurements. Low Apdex scores indicate user experience issues that require attention.
What makes a metric truly valuable for observability? Actionable metrics enable specific improvements. Vanity metrics that do not drive decisions waste monitoring resources.
While all this sounds great, we must mention at least some considerations you should take into account before kickstarting an observability implementation in your organization.
Investment of time and effort
Ensuring people across the organization understand the benefits, familiarize themselves with the tooling, and delve into implementation can be time-consuming and perhaps not worth the effort in smaller systems or where the stakes of what’s to be lost during downtime are not as significant.
Complexity of tooling
As you’ve hopefully understood by now, observability is not only a service or a language but a mix of components, including a specification, a protocol, semantic conventions, libraries, and SDKs, as well as services for data ingestion and visualization. Without the correct leadership, it’s easy to get lost or unclear about who is responsible for what.
Data volume
The amount of logs and emitted tracing information can easily eclipse the amount of actual data being processed or retrieved as part of your typical business operations.
Below is a rather “slim” example of the tracing data that will be pushed to your telemetry infrastructure when a user hits your reverse proxy with a GET /hello request, which will guide them to a backend and respond with a simple “hello”:
Otel-Instrumented nginx reverse proxy
{
"resourceSpans": [{
"resource": {
"attributes": [
{ "key": "service.name", "value": { "stringValue": "nginx-proxy" } },
{ "key": "service.version", "value": { "stringValue": "1.25.0-otel" } },
{ "key": "telemetry.sdk.name", "value": { "stringValue": "opentelemetry" } }
]
},
"scopeSpans": [{
"scope": { "name": "nginx-otel-module" },
"spans": [{
"traceId": "4bf92f3577b34da6a3ce929d0e0e4733",
"spanId": "00f067aa0ba902b7",
"parentSpanId": "0000000000000000", // root span
"name": "GET /hello",
"kind": "SPAN_KIND_SERVER",
"startTimeUnixNano": "1750000000000000000",
"endTimeUnixNano": "1750000000600000000",
"attributes": [
{ "key": "http.method", "value": { "stringValue": "GET" } },
{ "key": "http.target", "value": { "stringValue": "/hello" } },
{ "key": "http.status_code", "value": { "intValue": 200 } },
{ "key": "http.flavor", "value": { "stringValue": "1.1" } },
{ "key": "net.peer.ip", "value": { "stringValue": "203.0.113.17" } }
]
}]
}]
}]
}
Go backend with a /hello handler
{
"resourceSpans": [{
"resource": {
"attributes": [
{ "key": "service.name", "value": { "stringValue": "hello-api" } },
{ "key": "service.version", "value": { "stringValue": "v0.1.0" } },
{ "key": "telemetry.sdk.language", "value": { "stringValue": "go" } }
]
},
"scopeSpans": [{
"scope": { "name": "go.opentelemetry.io/otel" },
"spans": [{
"traceId": "4bf92f3577b34da6a3ce929d0e0e4733",
"spanId": "f9a3bc4d5e6f8a12",
"parentSpanId": "00f067aa0ba902b7", // proxy span
"name": "helloHandler",
"kind": "SPAN_KIND_SERVER",
"startTimeUnixNano": "1750000000200000000",
"endTimeUnixNano": "1750000000500000000",
"attributes": [
{ "key": "http.method", "value": { "stringValue": "GET" } },
{ "key": "http.route", "value": { "stringValue": "/hello" } },
{ "key": "http.status_code", "value": { "intValue": 200 } },
{ "key": "app.logic.time_ms", "value": { "doubleValue": 2.3 } }, // custom
{ "key": "net.host.name", "value": { "stringValue": "hello-svc-7c9b4" } }
]
}]
}]
}]
}
Observability can and will add overhead if the proper rules on sampling, aggregation, and filtering are not applied. Luckily, the ongoing efforts in telemetry already expose multiple configuration options and provide guidance on efficiently implementing them.
Spacelift allows you to connect to and orchestrate all of your infrastructure tooling, including infrastructure as code, version control systems, observability tools, control and governance solutions, and cloud providers.
Spacelift enables powerful CI/CD workflows for OpenTofu, Terraform, Pulumi, Kubernetes, and more. It also supports observability integrations with Prometheus and Datadog, letting you monitor the activity in your Spacelift stacks precisely.
With Spacelift you get:
- Multi-IaC workflows
- Stack dependencies: You can create dependencies between stacks and pass outputs from one to another to build an environment promotion pipeline more easily.
- Unlimited policies and integrations: Spacelift allows you to implement any type of guardrails and integrate with any tool you want. You can control how many approvals you need for a run, which resources can be created, which parameters those resources can have, what happens when a pull request is open, and where to send your notifications data.
- High flexibility: You can customize what happens before and after runner phases, bring your own image, and even modify the default workflow commands.
- Self-service infrastructure via Blueprints: You can define infrastructure templates that are easily deployed. These templates can have policies/integrations/contexts/drift detection embedded inside them for reliable deployment.
- Drift detection & remediation: Ensure the reliability of your infrastructure by detecting and remediating drift.
If you want to learn more about Spacelift, create a free account or book a demo with one of our engineers.
In this article, we explored how observability has evolved from a nice-to-have technology to a can’t-justify-not-having-it technology. We reviewed several implementation aspects, key metrics, best practices, and challenges.
Although it can be challenging to implement observability effectively, it is highly rewarding. Monitoring has evolved towards fully detailed telemetry, correlating logs, metrics, and code execution across services with key observability metrics.
And people are noticing. Most tech leaders acknowledge that modern, cloud-native, and distributed systems generate data that exceeds humans’ ability to manage it. The solution appears to be doubling down on adopting a unified platform for observability and investing in AIOps, which promises to identify issues faster based on telemetry data.
Take DevOps monitoring to the next level
Spacelift is a infrastructure orchestration platform that allows you to connect to and orchestrate all of your infrastructure tooling, including monitoring, infrastructure as code, version control systems, observability tools, control and governance solutions, and cloud providers.