DataAnalyticsAITelecommunications

How Adaptive Telemetry can help you combat data overload

Do you ever find yourself struggling to get to the data that matters? In the modern world, information is more abundant than ever before.

It’s no different for organisations dealing with observability data. Today, data overload is becoming a common issue across many organisations because of the sheer volume of data they are ingesting.

Grafana Labs, a leader in the observability space, has come up with an effective way to do this, using Adaptive Telemetry to reduce the volume of data that needs to be stored (on average by a rate of 40%), all without compromising on the stability of the organisation or its operations. Customers can see tremendous cost benefits, and importantly, they can solve key business problems with their observability data

Currently, their Adaptive Telemetry suite is included in their full stack observability platform, Grafana Cloud, and is also now offered as part of BYOC deployments.

At Grafana’s ObservabilityCON last week in London, the company announced that one of their newer additions to the suite, Adaptive Traces, is now generally available, while their newest addition, Adaptive Profiles, is in private preview.

To understand more about how Adaptive Telemetry works, and how it can help organisations combat data overload to optimize costs and extract signal from noise, we spoke with Sean Porter, Distinguished Engineer at Grafana Labs, and founder of TailCtrl, a company specialising in adaptive trace sampling, now acquired by Grafana.

Read on to find out more about how Adaptive Telemetry works, how it is making engineers’ lives easier, and what its co-evolution with AI means for the future.

How does Adaptive Telemetry work?

Adaptive Telemetry is essentially a smart form of data reduction. It uses machine learning capabilities, in combination with human inputs and best practices from our industry expertise, to intelligently capture valuable data, discarding the rest.

Given that there are many types of data, Grafana’s Adaptive Telemetry suite has been developed to provide unique sampling solutions for different data signals. These include: logs, metrics, traces, and profiles.

The behaviour of each type of signal and what it represents within the organisation determines the approach taken towards optimizing it. These approaches reflects different the different priorities and applications for each type of signal.

For example, most organizations produce many partially or entirely unused metrics; usage-based recommendations to aggregate away unused labels is a highly successful approach for this signal.

Meanwhile, sampling based on key attributes like status, combined with intelligently sampling anomalous telemetry; using tail sampling is an effective approach for Traces.

“In our Adaptive Telemetry Suite, we essentially have one solution per signal, because they’re very tailored to the type of signal, their behaviour. We couldn’t make just one generic adaptive solution that works for all these different signal types. And that gives us a much narrower, more constrained focus of building these feature sets around the behaviours of each signal.” ~ Sean Porter, Distinguished Engineer at Grafana Labs

As Porter points out, the development of specific approaches towards each type of signal is particularly important given that the data sent to Grafana Labs to be sampled doesn’t come with any context that could provide other clues to its importance or utility.

“Somebody else is sending us the data, so we don’t know ahead of time the context of the data (which would make it a lot easier). So we have to develop approaches to achieve the results we need without knowing the nature of the data.”

So far, Grafana’s approach has had outstanding results across a variety of use-cases, even when this data comes from very different systems and industries.

Nevertheless, Adaptive Telemetry still selectively leverages user inputs and rules to allow humans who know their organizational constraints best to steer or place guardrails on Adaptive Telemetry .

“Until somebody says this doesn’t work for us, we won’t know. But so far, our approaches work across everything, from regular system metrics, to application metrics, to even IoT (Internet of Things).”

Below, we consider the four different signals that can be sampled in Grafana’s Adaptive Telemetry suite, and the key differences between them.

Adaptive Logs

Logs are a traditional form of data, used even back in early human civilisations by sailors to keep a record of the events of each day. Within software, they simply refer to a record of every event that happens within a given system that is written in natural language.

As you can imagine, this means that the volume of logs is vast, even for simple applications. Furthermore, since logs are written in natural language, it takes a long time to manually analyse them to identify optimization opportunities.

“For logs, we use the DRAIN Algorithm to determine patterns in people’s logs. This allows us to identify ones which look to be of low value, so that they can drop a certain percentage of ones that are basically just noise.”

While DRAIN is used in other parts of the Grafana platform, within Adaptive Logs it is tailored specifically to identify longer running patterns – perfect for identifying repeated, low value log patterns that can be reduced via sampling.

Currently, this algorithm enables a drop rate of between 20-30% on logs. While this rate may seem on the conservative side, this caution is in fact for the user’s benefit, given that once the logs are dropped, they are permanently lost.

“With logs, we just drop the unnecessary data, so it’s fully gone. This sometimes scares people, so we are working on some interesting stuff in this space where we can archive the data in low-cost storage rather than dropping it completely.”

Adaptive Metrics

Metrics are the measurement of some numeric value for a given object. They have more utility than logs when it comes to statistical analysis, and are typically used to track the performance of a given operation, analyse a specific type of data, or assess the conditions of various fields/industries.

In this sense, they are a more ‘living’ form of data than logs, and can provide real-time insights into current dynamics of a system.

They are aggregated – grouped into smaller footprint versions of themselves – according to their usage. This is done in such a way that all of your existing operations – metrics used in dashboards, alerts, and ad hoc queries – are preserved with full fidelity.

“Adaptive metrics are usage-based, so looking at whether the metrics are being used in dashboards and alerts, or ad-hoc queries and incident investigations.”

This approach has resulted in customers reducing their metrics footprint by 50% or more without compromising their effectiveness. Many customers are also taking advantage of a current preview feature allowing them to continuously automatically apply the latest and greatest set of optimization, taking the human review burden out of the loop entirely.

The metrics that are not captured are not fully dropped; instead they are aggregated, which reduces their volume significantly, but still keeps the option open for users to disaggregate them easily if and when needed.

“In metrics, we don’t like to drop data. Instead we aggregate it. So we go from tens of thousands of series to one or two. So that means that our customers still have the data there to dis-aggregate if needed.”

Currently, disaggregation is done manually, but this could change in the near future.

“Right now if somebody keeps querying, and they keep getting an aggregated set, either they can hit a button to disaggregate the data directly, or maybe it’s clear to us that they need it, and so we’ll disaggregate it for them. But we’re experimenting with capabilities to do this in real-time.”

Adaptive Traces

Traces are a more comprehensive form of data than metrics. As an event occurs and travels through the system, it leaves a trace, i.e. an internal record of the event’s journey through the system from start to finish.

This record consists of spans, which represent accounts of single events, so basically structured logs. Spans call other spans in what can be best described as a parent-child relationship.

At the beginning of a trace, there is a root span, i.e. the one that is created initially. This span then calls its ‘child’ span, which in turn then calls its ‘child’ span, and so on until you reach the end of the trace, where you have what are known as ‘leaf spans’ because they don’t call anything else.

A single trace, then, by itself incorporates quite a lot of data, which is why Adaptive Telemetry is so critical for this signal.

“There’s 10s of 1000s of traces being produced every second for an application, that’s why there’s so much data, and why Adaptive Traces is really important.”

When it comes to the sampling approach, spans play a key role, essentially forming the basis for anomaly detection, which then informs the sampling policy for traces.

“For anomaly detection, we generate time series for the spans, called span metrics. Then we create RED (request error duration) metrics for each of those spans, and use forecasting models on those metrics. When there are anomalies in the RED metrics, we change our sampling policy of what traces we collect.”

In contrast to logs and metrics, the sampling approach taken to traces is an indirect one, which mirrors the more complex composition of trace data itself.

“We don’t do anomaly detection yet on the traces themselves, instead we’re doing it on the time series that we’re generating from them.”

Similarly, in contrast to the other signals where reduction in data volume is the key goal, Adaptive Traces is really aimed at increasing the amount of useful trace data. This is important given that most organisations drop almost all of their traces. Thus, Adaptive Traces is actually a mechanism to introduce more observability into an organisation’s data flow.

“9 out of 10 organisations drop 99% of their traces already, just because of the sheer volume. They use head sampling, which is just making a decision of what to keep from the very beginning of a request, so from the entry point where you have no idea of what’s going on. What we’ve created with Adaptive Traces is relax that small window, to admit and produce a lot more traces. Then we do tail sampling to be more intelligent about what we keep using policies that align with the users’ goals.”

Sean Porter explaining Adaptive Traces and why it’s so important at Grafana’s ObservabilityCON 2025. Photo by Richard Theemling photography, courtesy of Grafana Labs.

Adaptive Profiles

Profiles are a specialised form of data that enables you to look at the resource usage of a given application. You can do this either by instrumenting your application with some code, or by running an agent on the system via EBPF that can see how much of the CPU or memory is being used for a given operation at high resolution (100x per second).

Thus, profiling data consists of continually emitting data around how many resources were used by various functions of an application.

Most importantly, profiles allow you to associate a performance issue of an application with a specific line of code that has triggered the issue, even indirectly.

The sampling approach taken towards profiles is anomalous, i.e. it relates to when the profile changes its usual shape.

“With profiles, we’re experimenting to see if we can generate metrics like we do from traces, but one of the things we use at the moment for profiles is isolation forests as a way of observing a profile over time, and seeing if its shape changes or shifts.”

Often, these shifts (especially spikes) relate to one of the engineers deciding to release/capture more data, which then uses up more processing power and memory storage. The power of profiles is in its diagnostic ability to link these events to system performance via just one line of code.

“For a CPU profile, for example, we would be looking at how much CPU time a given application uses for a set of given operations. Then we would observe this over time to see if and when this shifts, why, and whether it was linked to a release that somebody made that is then causing the performance regression. And through profiles we can literally point to the line of code that triggered the shift.”

Adaptive Profiles helps customers retain only the data resolution needed to extract these optimization insights.

How Grafana is making engineers’ lives easier with its Adaptive Telemetry suite

The recent developments in Grafana’s Adaptive Telemetry suite are increasingly focused around the goal of making engineers’ lives easier. While the evolution of AI and automation has theoretically taken some of the strain of digital workloads off engineers, the tangible benefit of this is yet to be felt in many industries. Primarily, this is because most adopters of these technologies are in it for the cost benefit more than anything else.

This was true for Grafana too when the company started out their suite with just the logs and metrics. But the newer releases, Adaptive Traces and Profiles, represent a shift in Grafana’s priorities from cost management to user experience.

“When we did adaptive metrics and logs, cost reduction was the primary value. But adaptive traces is most importantly about reducing noise and providing clarity to the developer and user of the trace, essentially reducing toil for them because there’s less stuff to wade through to find the actual value. Cost-management is always gonna be a big part of it, but I think we’ll see that as time goes on, only the minor value will be in the cost management side.”

As to be expected, automation is a key mechanism by which Grafana (and most other SaaS providers) reduce the strain on engineers. And this is where AI and ML plays a key role in the actual sampling of telemetry data.

“A lot of the sampling is automated, using machine learning. On the time series forecasting for traces RED metrics, for example, we’re using KRONOS, which is Amazon’s forecasting model built on a LLM.”

Additionally, Grafana are automating cross-functional processes, making the whole experience of incident investigation feel a lot more seamless for the user.

For example, when an anomaly is detected, the percentage of traces being tracked is automatically increased, and a workflow is automatically triggered in Pyroscope (the workspace for profiling data) in order to establish an adaptive profile.

Nevertheless, Grafana remains mindful that over-automation can lead to a lack of control over data workflows, which can actually create more work in the longer term. For this reason, they have added several user controls to the suite, allowing users to manually customise the sampling.

“We’ve been adding more levers and controls so that our customers can really fine tune how all this stuff works in their organisation. We also have segments, where you can basically corner off parts of your organisations via labels and customise the sampling, i.e. not to touch the data at all, or be really aggressive with it.”

 

 

Author

  • I write about developments in technology and AI, with a focus on its impact on society, and our perception of ourselves and the world around us. I am particularly interested in how AI is transforming the healthcare, environmental, and education sectors. My background is in Linguistics and Classical literature, which has equipped me with skills in critical analysis, research and writing, and in-depth knowledge of language development and linguistic structures. Alongside writing about AI, my passions include history, philosophy, modern art, music, and creative writing.

    View all posts

Related Articles

Back to top button