Introduction to Observability

Boost system reliability with observability! Learn how to monitor, trace and analyze applications for peak performance & fast issue resolution. Start mastering it now!

Sep 27th, 2025 5:05am by B. Cameron Gain

Featued image for: Introduction to Observability

Image by Logan Voss from Unsplash.

What Is Observability?

Observability is not a process but a concept. While its potential remains largely latent, its utility in DevOps is moving rapidly from monitoring and debugging. Automating analysis, making decisions and instrumenting decisions: these are just a few examples of how observability increasingly supports your IT operations and development processes.

The concept of observability can be applied to anything. When you bake a cake, you can think of the ingredients as the API — or the API as the recipe. You plug in the ingredients and then do the work of baking the cake.

The observability aspect would help you monitor. Part of it is monitoring the status of the cake — whether all the ingredients are prepared, whether the right quantities have been added. Once that is done, observability would ensure that the ingredients in the cake itself, as it bakes in the oven, are subject to the right humidity, temperature and other variables required to successfully bake that cake.

Other variables should be monitored in real time. When you see errors and problems occur, that is when you can take action. (“Here is what you need to do to make sure this cake is going to be successfully baked.”)

Once that cake comes out of the oven, it could still be monitored. Any kind of user problems could trigger alerts, so that the cook or baker can step in and mitigate those alerts. All along the way, when there are alerts and problems for debugging the process or any kind of hiccups on the way to baking that cake, the observability system should ideally provide fixes and guidance about what you need to do.

What Observability Means in IT

Let’s look at how observability is applied to IT, DevOps, and the development and operations processes. Despite what you may have heard, observability is about much more than logs, metrics and traces — once considered the “three pillars of observability.” Clinging to the “three pillars” concept can arguably set back those trying to learn about observability.

True, the three standardized signals tracked through the open source framework OpenTelemetry (OTel) are metrics, logs and traces (or spans). While often referred to as the “three pillars of observability,” they are more accurately intertwined and correlated for proper observability.

Traces, a collection of spans, are particularly important in microservice environments, as they show dependencies and the flow of calls through services, aiding in troubleshooting and visualization. Logs provide detailed records and timestamps, and metrics quantify various aspects of system performance. OpenTelemetry helps connect all these pieces at the data layer for full context and troubleshooting, which is vital for understanding application behavior.

In reality, logs, metrics and traces, taken individually, don’t tell us much. Some users still must shift between screens to locate and read the telemetry separately, hence adding to the risk of creating silos when assessed individually, since they should be intertwined and used together. Proper observability involves processing these signals together as sources of information, adding to a coherent analysis of what is happening in a system.

Beyond logs, metrics and traces, other aspects of observability are increasingly coming into play. Meanwhile, there’s a movement for making observability platforms simpler to use.

The definition of observability is evolving, with a primary goal to help users understand the behavior of their software without needing to invest a significant amount of time learning how to use the tools. This simplification is crucial as observability expands beyond developers and engineers to stakeholders like site reliability engineers (SREs), CTOs and product managers, who desire a more streamlined view of telemetry data.

New Relic’s Intelligent Observability Platform, for instance, is designed as an all-comprehensive observability solution that is designed to be easier for non-operations teams to use.

An observability platform must draw conclusions for the user and solve problems, according to New Relic CEO Ashan Willy. This involves addressing issues either interactively or automatically, and even communicating with other systems of action to automatically resolve issues. The end result is significantly reducing mean time to detect and mean time to repair.

The goal for any ideal observability platform is to move beyond simply capturing telemetry, putting it in a database, and dashboarding or alerting on it, as that contributes to information overload. Instead, an intelligent data platform should pull everything together, letting users ask questions they didn’t know they had, providing powerful query capabilities, and an intelligent action platform that can anticipate the questions users need to ask.

Why Observability Matters

Observability is crucial for modern IT systems since, to provide just a few examples, the process and tools offer:

Enhanced debugging and troubleshooting. By providing detailed insights into system behavior, observability enables faster and more effective debugging. Teams can pinpoint the root causes of issues and understand the impact of changes in near real-time.

Proactive issue resolution. Observability allows teams to detect anomalies and potential problems before they escalate into major incidents. This proactive approach helps maintain system reliability and minimizes downtime.

Optimized performance. Continuous monitoring and analysis of metrics and traces help identify performance bottlenecks and optimize resource usage. This leads to improved system efficiency and user experience.

Informed decision-making. Observability provides actionable insights that inform decision-making processes. Teams can make data-driven choices about system architecture, resource allocation and feature development.

How Do We Implement Observability?

To implement observability effectively, organizations need to:

Adopt comprehensive tools. Use observability platforms that integrate logging, metrics and tracing capabilities. These tools should provide near real-time data visualization, alerting and analytics. Note that most organizations require multiple tools to achieve comprehensive coverage.

Integrate with existing systems. Ensure that observability tools integrate with the current tech stack and support modern development practices such as microservices and containerization. This integration often requires significant effort and customization.

Foster a culture of observability. Encourage cross-functional collaboration and a proactive approach to monitoring and maintaining system health. Educate teams on the importance of observability and best practices for leveraging its benefits.

What Role Does AI Play in Observability?

AI is playing a crucial role in the evolution of observability, making it more accessible and proactive. AI is empowering observability by making it easier for users, including junior engineers and senior leaders, to understand software behavior without extensive tool learning.

AI assistants help in querying data, a task that would otherwise take significantly longer. Grafana’s AI Assistant, for example, leverages a knowledge graph built from telemetry data and connects to a large language model (LLM), allowing users to ask open-ended questions and get actionable insights quickly. This is particularly valuable for senior leaders who may no longer have the time or familiarity to query systems themselves, and it accelerates onboarding and productivity for junior engineers.

AI and machine learning, along with generative AI, are expected to have a profound impact on observability’s development and use. New offerings will leverage AI and machine learning (ML) to analyze and process telemetry with well-trained LLMs.

In observability solutions powered by AI, the accuracy of AI interpretation and its ability to analyze metrics to communicate actionable insights (semantic observability) are crucial. Both require understanding the decisions a system is making at a deeper level. The reasoning process can be hidden deep within an LLM’s inference process, making it difficult to capture with traditional logs.

The field of observability needs to look deeper into the decision-making mechanisms inside the model to properly diagnose and debug systems.

Data Lakes and LLM Observability

Data lakes, consisting of data storage repositories, are becoming essential for observability, especially with the rise of LLMs and generative AI. An observability platform within a data lake can analyze data while ensuring data sovereignty and security, allowing for continuous training of LLMs for improved AI-assisted data analysis (more on data lakes below).

Secure data lakes deployed in a virtual private cloud are ideal for agentic workflows to enable AI-assisted troubleshooting. Agentic AI troubleshooting apps can be built that generate and execute queries without any data leaving the organization’s boundaries. AI agents rely on large, high-quality datasets that a data lake provides.

Data lakes are also critical for LLM observability, especially with retrieval-augmented generation (RAG) architectures where LLM applications are composed of many calls to LLM models, external functions, databases, and knowledge bases. A data lake can connect all these pieces at the data layer for full context and troubleshooting.

LLM observability allows users to constantly and directly evaluate the model’s and RAG’s quality and reliability, as the evaluation data is ingested into the data lake, maintaining the organization’s control over it.

Datadog is an example of an observability company that is leaning heavily on AI in its solutions. viewing it as a dual strategy that presents both significant opportunities and new challenges. The company is leveraging AI to shift from a reactive approach to a proactive one, aiming for AI agents to automatically handle the majority of issues.

This strategy involves mixing and matching commercial and proprietary AI models, as the lead model for specific tasks can change frequently. Datadog is also training its own models, including a state-of-the-art time series forecasting model.

Datadog’s Internal Developer Portal (IDP) also incorporates AI agents to simplify DevOps and observability for developers, operations engineers and non-technical users. This includes using AI to crawl metadata, architecture diagrams and runbooks to assist with incident resolution. The IDP reflects customer demands to consolidate tools, providing smarter, simpler-to-interpret telematics data and analytics for a wider range of stakeholders.

While there is much excitement around AI, there is still a lack of clarity on how generative AI might be leveraged to create low-code observability artifacts. Most current generative AI implementations are low-risk and internal.

The challenge for AI in observability includes dealing with the sheer amount of tokens returned by query data APIs, which can lead to agents getting confused or hallucinating. Model Context Protocol (MCP) servers can provide an abstraction layer to simplify data structures and remove unneeded fields before returning them to an LLM.

In the longer term, generative AI-enabled conversational interfaces are expected to facilitate technology commercialization, democratizing AI and other technologies.

What Is OpenTelemetry?

OpenTelemetry (OTel) provides a standardized, vendor-neutral framework and toolkit for observability, unifying telemetry data like logs, metrics and traces. It is not a tool or a platform, but rather an approach akin to best practices and standards for platform engineering or DevOps.

A key feature is its standard data format and protocol, allowing different observability tools to be used interchangeably or together without needing to re-instrument or start over when changing providers.

OpenTelemetry is a Cloud Native Computing Foundation project, similar to Kubernetes or Prometheus. OTel arose from the merger of different observability standards; It standardizes how data is collected and generated from applications or infrastructure.

Before OpenTelemetry, vendors used proprietary agents to gather observability data. But now, OTel offers an open source API and SDK, allowing developers to inject logs, metrics and traces into their code based on a standard.

Most vendors, including Datadog, Dynatrace, Elastic and Grafana are OpenTelemetry compliant, meaning they can ingest data instrumented with OpenTelemetry. This also enables vendors like Vercel or Cloudflare to incorporate OpenTelemetry into their serverless stacks, allowing users to obtain application data in the OpenTelemetry standard (OTLP protocol).

OpenTelemetry also provides agents for languages like Java and Node.js, which automate the instrumentation process for users who cannot or prefer not to do it manually. The OpenTelemetry Collector acts as an agent, receiving data from the API, SDK or auto-instrumentation, and then sending it to a backend. The collector is a standard, downloadable component, as are the agents, and there is a specification for the API and SDK.

A crucial component is the semantic convention, which standardizes the tagging system used for data. This resolves previous issues where different vendors used varied naming conventions for tags (e.g., Kubernetes.pod.name vs. k8s.pod.name), making it difficult to combine data. This standardization allows for easier correlation of data and a better understanding of system behavior.

OpenTelemetry is still under development. But it has significantly transformed observability by offering standards and benefits across metrics, logs and traces. Its success relies on continued community support. Leading observability providers such as Datadog, Grafana, Honeycomb and Splunk support and contribute to OpenTelemetry.

The Challenges with OpenTelemetry

Despite its advantages, OpenTelemetry is not a “magic button” and its usefulness depends on the observability tools used with it. It is not designed to replace observability platforms, but rather to unify the data for deeper analysis.

There have been compatibility challenges, particularly with Prometheus, due to differences in design philosophy. OpenTelemetry is more of a push-based metrics system, while Prometheus is pull-based.

Before Prometheus 3.0 was released in November 2024, integration with OpenTelemetry was challenging. Issues included how default resource attributes were sent to a different metric, making queries hard. Prometheus 3.0 introduced improvements, such as the “Promote Resource Attributes” configuration option, to address this.

Other compatibility problems in Prometheus prior to version 3.0 included a lack of support for dots in metric names (a character used by OpenTelemetry), and other fundamental limitations related to UTF-8 support. Prometheus natively supports only cumulative metrics, while systems like Datadog, OpenTelemetry and StatsD can push deltas; native support for deltas in Prometheus is still being figured out.

Compatibility problems with data formats, specifically histograms and the data forwarding protocol, also remained an issue. Native histograms and a new feature for custom buckets were introduced to address these. Remote Write 2.0 also improves efficiency and supports newer Prometheus features like native histograms.

Despite ongoing issues, work is being done by the open source community to overcome them in upcoming Prometheus releases. A 2025 Grafana survey showed that the majority of enterprises use both OTel and Prometheus, indicating they are complementary rather than mutually exclusive.

In 2024, Elastic committed to further integrating Elasticsearch with OpenTelemetry to enhance search experiences, recognizing OpenTelemetry’s pivotal role in seamless data monitoring and analysis.

Why the OpenTelemetry Profiler Matters

The OpenTelemetry Profiler, expected to be finalized in 2025, will extend observability analysis to the code level, instrumentalizing deeper analysis of metrics, traces and logs for faster problem identification and resolution. Grafana Labs is heavily involved in the Profiler’s development and offers open source Beyla for tracing via eBPF, or extended Berkeley packet filter, a Linux kernel technology that is closely linked to profiling.

OTel’s Profiler is useful because it extends observability analysis to the code level. It instrumentalizes a deeper analysis of metrics, traces, and logs by extending telemetric data pulled together in a unified stream that reaches the code level for applications throughout the network. This means that when a problem arises, such as a slow CPU or a long end-user request, the profile can pinpoint the code at issue.

With additional observability tools, fixes should be provided faster as users can more easily pinpoint problem code through their queries.

Austin Parker, director of open source for Honeycomb, has described how profiles offer support for bi-directional links. This allows users to dig deeper from telemetry data to the corresponding profile at a code level. For example, in a post on the OpenTelemetry blog, Parker described:

Metrics to profiles: Spikes in CPU or memory usage are translated into the code consuming the resources at runtime.
Traces to profiles: In addition to pinpointing high latency across the network, the profile attached to a trace or span reveals the code responsible for the high latency.
Logs to profiles: Beyond using logs for tracking issues like out-of-memory errors, the code responsible for extra memory consumption is shown for further analysis.

Elastic has also donated its Continuous Profiling Agent to the OpenTelemetry profiling community.

This donation is significant because it provides whole system profiling using eBPF, allowing users to profile not only their application processes but all running processes. This helps users tie code changes to performance degradation and identify if other elements, like third-party agents, are impacting performance.

The Elastic profiling agent can be integrated into the existing tool ecosystem, accelerating the availability of profiling to users and integrating it with existing signals.

Specific benefits of Elastic’s continuous profiling agent to the OpenTelemetry project include:

Continuous profiling data complements existing signals (logs, metrics and traces) by providing detailed, code-level insights on service behavior.
Seamless correlation with other OpenTelemetry signals, such as traces, increasing fidelity and investigatory depth.
Estimating environmental impact, by combining profiling data with OTel’s resource information, which allows for insights into a service’s carbon footprint.
A detailed breakdown of service resource usage, which provides actionable information for optimization.
A vendor-agnostic eBPF-based profiling agent removes the need to rely on proprietary agents for profiling telemetry.

The donation from Elastic will allow users to profile all running processes on a system and understand their contribution to CPU consumption. This information can then be used to optimize CPU consumption, thereby reducing the amount of energy consumed by the CPU. Datadog is also a main contributor to the development of OpenTelemetry and its Profiler

Splunk has also begun the process of donating its .Net Profiler, which will allow OpenTelemetry to capture profiles from C#, F#, and other .Net applications. The work for Splunk’s Profiler for OpenTelemetry is ongoing.

The OTel Profiler is considered a milestone for open source and observability. Continuous profiling signals could be as critical as metrics, traces, and logs data. While profiling has been available to the public in various forms for over six years, it is not yet widely known or used in the industry as much as metric, log and trace analytics. With the addition of profiling to OpenTelemetry, continuous production profiling is expected to hit the mainstream.

The roadmap for the OTel profiles project includes the following features:

Profiles Data Model.
Profiles API.
Profiles SDK.

What Are Data Lakes?

Data lakes have become a critical component for many organizations for business analytics, product execution and observability. They are seen as a necessity for observability, serving as a single repository for data needed for monitoring and debugging.

Data lakes also provide capabilities for inferences, deep analytics and discovering problems before they occur, which is not possible without them. Achieving maximum observability is arguably contingent on properly applying AI to the telemetric data in a data lake.

A data lake is defined by Gartner as a semantically flexible data storage repository combined with one or more processing capabilities. Most data assets are copied from diverse enterprise sources and stored in their raw and diverse formats, so they can be refined and repurposed repeatedly for multiple use cases.

A data lake should ideally store and process data of any container, latency or structure, such as binary large objects (BLOBs), documents, files, formats, messages, result sets and tables.

Organizations that do not have a data lake strategy in place are already missing out, and many may be surprised by how feasible and accessible data lakes are for observability.

How to Create a Data Lake

The creation of a data lake should not require an organization to completely re-instrument its data flows or create separate ingresses and APIs for separate data streams for telemetry data. An observability data lake should be able to accept data from the entire application stack, integrating different data sets to build context.

Without the data collection a data lake offers, there is a lack of flexibility in bringing data from the entire application stack from telemetry data sources. This includes data from Prometheus for metrics, Jaeger for traces, and Loki for logs.

With a data lake, all telemetric data is combined without the need to reconfigure and manage separate data feeds. The data resides in the backend together, allowing users to access logs, metrics and traces together when running queries or using dashboards. This enables users to more directly get to the root cause when troubleshooting, instead of trying to combine telemetry data from different SQL queries.

Organizations that have invested in instrumentation for decades do not want to re-instrument their applications and infrastructure. This is why a schema-less data lake is critical, so it can accommodate all data types without a predefined structure.

Data does not have to be structured or parsed before ingestion; grok scripts are not required, and there is no need to spend hours pre-processing or tagging data before it can be used for observability.

Any data type can be channeled to and stored in a data lake, including data from containers, documents, email logs, files, Slack and spreadsheets, in addition to logs, metrics and traces. Users should be able to simply point their collector to the data lake, where it is parsed and integrated.

The integration of all observability streams involves unifying telemetry data to map and link relevant data sets together. All telemetry data is in one data lake, with one query language and one consistent UI for faster correlations and troubleshooting. Some data lakes, like Kloudfuse, allow access to data through GraphQL, LogQL, PromQL, SQL or TraceQL, which is not true for all vendors.

A data lake for observability consists of a centralized repository for telemetry data correlations that an observability provider should offer. The integration of other public or private data sources to create AI agents tailored to specific use cases, such as root-cause analysis, troubleshooting, predictions, and infrastructure as code support, is also required.

However, an off-the-shelf online analytical processing (OLAP) solution, or a data lake that does not fit these criteria, cannot accurately be described as an observability data lake.

The Benefits of Data Lakes

A data lake removes data silos. Without data silos, relationships between entities are created, such as logs, metrics, traces and more. Users can ask any question about interdependencies in their distributed system. They can quickly drill from user sessions (RUM/frontend observability), to services, to metrics and then pivot to logs, all while maintaining context, allowing for much faster insights during troubleshooting.

As the data in them expands, observability data lakes deployed on-premise can offer demonstrable cost savings. Customers can control the level of analysis granularity needed for analysis and root-cause discovery, versus the cost they want to pay. They get fixed costs instead of per-usage/calls to a vendor, with no overages or egress fees to transfer data to a Software as a Service (SaaS) observability vendor.

This level of control is critical for cost constraints, as each analysis and query is costly, especially with a pay-for-usage SaaS platform like Datadog. Many organizations trim their data before sending it to vendors for observability.

Logs and traces can account for a lot of data, often referred to as high cardinality data. Having the data lake in-house allows dynamic decisions on when to perform deep analysis with a lot of data (e.g., during troubleshooting) and when to hold back and only look at aggregates to reduce compute costs on Amazon Web Services or Google Cloud Platform usage.

A real-time OLAP design, closer to a lakehouse concept, can enable real-time analytics, monitoring, and alerting. A real-time data lake can handle large volumes of data and many concurrent queries with very low query latencies. Queries can have ultra-low latency, high query concurrency, or high data freshness (streaming data available for query immediately upon ingestion).

Unlike proprietary observability solutions, observability with a data lake allows data storage to be handled by low-cost storage. It can be built to scale as volumes grow without extra costs. There is no single point of failure; if tables are configured for replication and a node goes down, the cluster can continue processing queries. For horizontal scalability, a cluster can be scaled by adding new nodes when the workload increases.

The Challenges with Data Lakes

While data lakes can foster real-time processing and help meet scaling needs, they lack built-in IP that makes them purpose-built for observability use cases. Schemaless ingest and real-time analytics must be provided. An OLAP or a data lake alone can be a solid starting point, but they are not observability data lakes without schemaless ingest from open source or vendor-specific agents to make data readily available for real-time monitoring and alerting, as needed in observability use cases.

Observability data lakes need to ensure fast query performance and ultra-low query latencies for high query concurrency workloads typical in root cause analysis and troubleshooting modes. For this, proper observability data lakes should provide developed indexes for queries and analysis.

Additionally, for the management of storage and high-cardinality and dimensional telemetry data, observability data lakes should provide decoupling of storage and compute, as well as aggregations, deduplication, and compression techniques to ensure proper storage of observability data volumes.

Data Lakes and LLM Observability

Data lakes are critical to LLM observability for two reasons. With a retrieval automated generation (RAG) architecture, LLM applications are composed of chains of calls, some to LLM models, some to external functions, databases and knowledge bases. Connecting all these pieces together at the data layer requires a strong backend data lake.

Similar to conventional observability, these calls are tracked through traces and spans, identifying latency and performance and relating failures to other telemetry data like logs and real user monitoring. A data lake is capable of relating all these data sets together for full context and troubleshooting.

Additionally, many calls are made to augment LLM responses for accuracy or domain knowledge that a general-purpose LLM does not offer. With LLM observability, the user can constantly and directly evaluate the model’s and RAG’s quality and reliability since it is ingested in the data lake to which only the organization should have access, which is not necessarily the case without a data lake.

When adding data to fine-tune an LLM application, the organization does not want to send the evaluation of their model outside of their zero trust security layer.

Future Trends in Observability

The realm of observability is always progressing, propelled by advancements and shifting consumer demands. Companies must grasp these developments to remain at the forefront of their observability strategies. They should make the most of cutting-edge solutions to enhance system performance and dependability.

But from the outset, observability applied to CI/CD, DevOps, Kubernetes management or for anyone who has a stake in an IT structure, has only begun to expand significantly beyond monitoring and debugging.

The gap between traditional observability (logs, metrics and traces) and the needs of AI systems is significant. Traditional methods look at costs, API calls, and token counts, as well as delays and output. However, the concept of observability needs to be expanded for AI and MCPs.

Learn More About Observability at The New Stack

Here at The New Stack, our primary focus is to keep you up to date on the advancements and recommended strategies in observability. As technology and software development progress, it’s crucial to stay informed about the trends and tools to ensure your systems remain robust, efficient and high-performing.

We offer articles, guides and real-life examples that delve into facets of observability. This includes assessments of tools, practical tips for implementing observability across various environments, as well as insights into emerging trends like integrating AI and serverless observability. Our content aims to help you use observability effectively to boost system performance, increase reliability and enhance user satisfaction.

Through our platform you’ll find expert insights from industry professionals who share their knowledge and experiences with observability. Learn from implementations in the field. Gain valuable advice on how to overcome common challenges for successful results.

Become part of our community of developers, DevOps experts and IT professionals who are enthusiastic about observability. Take advantage of our range of resources to improve your techniques. By staying connected with The New Stack, you’ll always be at the forefront of observability trends armed with the knowledge and tools required to navigate the intricacies of IT environments. Drop by thenewstack.io for all the updates.

BC Gain is founder and principal analyst for ReveCom Media. His obsession with computers began when he hacked a Space Invaders console to play all day for 25 cents at the local video arcade in the early 1980s. He then...