
Do you ever find yourself struggling to get to the data that matters? In the modern world, information is more abundant than ever before.
Itโs no different for organisations dealing with observability data. Today, data overload is becoming a common issue across many organisations because of the sheer volume of data they are ingesting.
Grafana Labs, a leader in the observability space, has come up with an effective way to do this, using Adaptive Telemetry to reduce the volume of data that needs to be stored (on average by a rate of 40%), all without compromising on the stability of the organisation or its operations. Customers can see tremendous cost benefits, and importantly, they can solve key business problems with their observability data
Currently, their Adaptive Telemetry suite is included in their full stack observability platform, Grafana Cloud, and is also now offered as part of BYOC deployments.
At Grafanaโs ObservabilityCON last week in London, the company announced that one of their newer additions to the suite, Adaptive Traces, is now generally available, while their newest addition, Adaptive Profiles, is inย private preview.
To understand more about how Adaptive Telemetry works, and how it can help organisations combat data overload to optimize costs and extract signal from noise, we spoke with Sean Porter, Distinguished Engineer at Grafana Labs, and founder of TailCtrl, a company specialising in adaptive trace sampling, now acquired by Grafana.
Read on to find out more about how Adaptive Telemetry works, how it is making engineersโ lives easier, and what its co-evolution with AI means for the future.
How does Adaptive Telemetry work?
Adaptive Telemetry is essentially a smart form of data reduction. It uses machine learning capabilities, in combination with human inputs and best practices from our industry expertise, to intelligently capture valuable data, discarding the rest.
Given that there are many types of data, Grafanaโs Adaptive Telemetry suite has been developed to provide unique sampling solutions for different data signals. These include: logs, metrics, traces, and profiles.
The behaviour of each type of signal and what it represents within the organisation determines the approach taken towards optimizing it. These approaches reflects different the different priorities and applications for each type of signal.
For example, most organizations produce many partially or entirely unused metrics; usage-based recommendations to aggregate away unused labels is a highly successful approach for this signal.
Meanwhile, sampling based on key attributes like status, combined with intelligently sampling anomalous telemetry; using tail sampling is an effective approach for Traces.
โIn our Adaptive Telemetry Suite, we essentially have one solution per signal, because theyโre very tailored to the type of signal, their behaviour. We couldnโt make just one generic adaptive solution that works for all these different signal types. And that gives us a much narrower, more constrained focus of building these feature sets around the behaviours of each signal.โ ~ Sean Porter, Distinguished Engineer at Grafana Labs
As Porter points out, the development of specific approaches towards each type of signal is particularly important given that the data sent to Grafana Labs to be sampled doesnโt come with any context that could provide other clues to its importance or utility.
โSomebody else is sending us the data, so we donโt know ahead of time the context of the data (which would make it a lot easier). So we have to develop approaches to achieve the results we need without knowing the nature of the data.โ
So far, Grafanaโs approach has had outstanding results across a variety of use-cases, even when this data comes from very different systems and industries.
Nevertheless, Adaptive Telemetry still selectively leverages user inputs and rules to allow humans who know their organizational constraints best to steer or place guardrails on Adaptive Telemetryย .
โUntil somebody says this doesnโt work for us, we wonโt know. But so far, our approaches work across everything, from regular system metrics, to application metrics, to even IoT (Internet of Things).โ
Below, we consider the four different signals that can be sampled in Grafanaโs Adaptive Telemetry suite, and the key differences between them.
Adaptive Logs
Logs are a traditional form of data, used even back in early human civilisations by sailors to keep a record of the events of each day. Within software, they simply refer to a record of every event that happens within a given system that is written in natural language.
As you can imagine, this means that the volume of logs is vast, even for simple applications. Furthermore, since logs are written in natural language, it takes a long time to manually analyse them to identify optimization opportunities.
โFor logs, we use the DRAIN Algorithm to determine patterns in peopleโs logs. This allows us to identify ones which look to be of low value, so that they can drop a certain percentage of ones that are basically just noise.โ
While DRAIN is used in other parts of the Grafana platform, within Adaptive Logs it is tailored specifically to identify longer running patterns – perfect for identifying repeated, low value log patterns that can be reduced via sampling.
Currently, this algorithm enables a drop rate of between 20-30% on logs. While this rate may seem on the conservative side, this caution is in fact for the userโs benefit, given that once the logs are dropped, they are permanently lost.
โWith logs, we just drop the unnecessary data, so itโs fully gone. This sometimes scares people, so we are working on some interesting stuff in this space where we can archive the data in low-cost storage rather than dropping it completely.โ
Adaptive Metrics
Metrics are the measurement of some numeric value for a given object. They have more utility than logs when it comes to statistical analysis, and are typically used to track the performance of a given operation, analyse a specific type of data, or assess the conditions of various fields/industries.
In this sense, they are a more โlivingโ form of data than logs, and can provide real-time insights into current dynamics of a system.
They are aggregated – grouped into smaller footprint versions of themselves – according to their usage. This is done in such a way that all of your existing operations – metrics used in dashboards, alerts, and ad hoc queries – are preserved with full fidelity.
โAdaptive metrics are usage-based, so looking at whether the metrics are being used in dashboards and alerts, or ad-hoc queries and incident investigations.โ
This approach has resulted in customers reducing their metrics footprint by 50% or more without compromising their effectiveness. Many customers are also taking advantage of a current preview feature allowing them to continuously automatically apply the latest and greatest set of optimization, taking the human review burden out of the loop entirely.
The metrics that are not captured are not fully dropped; instead they are aggregated, which reduces their volume significantly, but still keeps the option open for users to disaggregate them easily if and when needed.
โIn metrics, we donโt like to drop data. Instead we aggregate it. So we go from tens of thousands of series to one or two. So that means that our customers still have the data there to dis-aggregate if needed.โ
Currently, disaggregation is done manually, but this could change in the near future.
โRight now if somebody keeps querying, and they keep getting an aggregated set, either they can hit a button to disaggregate the data directly, or maybe itโs clear to us that they need it, and so weโll disaggregate it for them. But weโre experimenting with capabilities to do this in real-time.โ
Adaptive Traces
Traces are a more comprehensive form of data than metrics. As an event occurs and travels through the system, it leaves a trace, i.e. an internal record of the eventโs journey through the system from start to finish.
This record consists of spans, which represent accounts of single events, so basically structured logs. Spans call other spans in what can be best described as a parent-child relationship.
At the beginning of a trace, there is a root span, i.e. the one that is created initially. This span then calls its โchildโ span, which in turn then calls its โchildโ span, and so on until you reach the end of the trace, where you have what are known as โleaf spansโ because they donโt call anything else.
A single trace, then, by itself incorporates quite a lot of data, which is why Adaptive Telemetry is so critical for this signal.
โThereโs 10s of 1000s of traces being produced every second for an application, thatโs why thereโs so much data, and why Adaptive Traces is really important.โ
When it comes to the sampling approach, spans play a key role, essentially forming the basis for anomaly detection, which then informs the sampling policy for traces.
โFor anomaly detection, we generate time series for the spans, called span metrics. Then we create RED (request error duration) metrics for each of those spans, and use forecasting models on those metrics. When there are anomalies in the RED metrics, we change our sampling policy of what traces we collect.โ
In contrast to logs and metrics, the sampling approach taken to traces is an indirect one, which mirrors the more complex composition of trace data itself.
โWe donโt do anomaly detection yet on the traces themselves, instead weโre doing it on the time series that weโre generating from them.โ
Similarly, in contrast to the other signals where reduction in data volume is the key goal, Adaptive Traces is really aimed at increasing the amount of useful trace data. This is important given that most organisations drop almost all of their traces. Thus, Adaptive Traces is actually a mechanism to introduce more observability into an organisationโs data flow.
โ9 out of 10 organisations drop 99% of their traces already, just because of the sheer volume. They use head sampling, which is just making a decision of what to keep from the very beginning of a request, so from the entry point where you have no idea of whatโs going on. What weโve created with Adaptive Traces is relax that small window, to admit and produce a lot more traces. Then we do tail sampling to be more intelligent about what we keep using policies that align with the usersโ goals.โ

Adaptive Profiles
Profiles are a specialised form of data that enables you to look at the resource usage of a given application. You can do this either by instrumenting your application with some code, or by running an agent on the system via EBPF that can see how much of the CPU or memory is being used for a given operation at high resolution (100x per second).
Thus, profiling data consists of continually emitting data around how many resources were used by various functions of an application.
Most importantly, profiles allow you to associate a performance issue of an application with a specific line of code that has triggered the issue, even indirectly.
The sampling approach taken towards profiles is anomalous, i.e. it relates to when the profile changes its usual shape.
โWith profiles, weโre experimenting to see if we can generate metrics like we do from traces, but one of the things we use at the moment for profiles is isolation forests as a way of observing a profile over time, and seeing if its shape changes or shifts.โ
Often, these shifts (especially spikes) relate to one of the engineers deciding to release/capture more data, which then uses up more processing power and memory storage. The power of profiles is in its diagnostic ability to link these events to system performance via just one line of code.
โFor a CPU profile, for example, we would be looking at how much CPU time a given application uses for a set of given operations. Then we would observe this over time to see if and when this shifts, why, and whether it was linked to a release that somebody made that is then causing the performance regression. And through profiles we can literally point to the line of code that triggered the shift.โ
Adaptive Profiles helps customers retain only the data resolution needed to extract these optimization insights.
How Grafana is making engineersโ lives easier with its Adaptive Telemetry suite
The recent developments in Grafanaโs Adaptive Telemetry suite are increasingly focused around the goal of making engineersโ lives easier. While the evolution of AI and automation has theoretically taken some of the strain of digital workloads off engineers, the tangible benefit of this is yet to be felt in many industries. Primarily, this is because most adopters of these technologies are in it for the cost benefit more than anything else.
This was true for Grafana too when the company started out their suite with just the logs and metrics. But the newer releases, Adaptive Traces and Profiles, represent a shift in Grafanaโs priorities from cost management to user experience.
โWhen we did adaptive metrics and logs, cost reduction was the primary value. But adaptive traces is most importantly about reducing noise and providing clarity to the developer and user of the trace, essentially reducing toil for them because thereโs less stuff to wade through to find the actual value. Cost-management is always gonna be a big part of it, but I think weโll see that as time goes on, only the minor value will be in the cost management side.โ
As to be expected, automation is a key mechanism by which Grafana (and most other SaaS providers) reduce the strain on engineers. And this is where AI and ML plays a key role in the actual sampling of telemetry data.
โA lot of the sampling is automated, using machine learning. On the time series forecasting for traces RED metrics, for example, weโre using KRONOS, which is Amazonโs forecasting model built on a LLM.โ
Additionally, Grafana are automating cross-functional processes, making the whole experience of incident investigation feel a lot more seamless for the user.
For example, when an anomaly is detected, the percentage of traces being tracked is automatically increased, and a workflow is automatically triggered in Pyroscope (the workspace for profiling data) in order to establish an adaptive profile.
Nevertheless, Grafana remains mindful that over-automation can lead to a lack of control over data workflows, which can actually create more work in the longer term. For this reason, they have added several user controls to the suite, allowing users to manually customise the sampling.
โWeโve been adding more levers and controls so that our customers can really fine tune how all this stuff works in their organisation. We also have segments, where you can basically corner off parts of your organisations via labels and customise the sampling, i.e. not to touch the data at all, or be really aggressive with it.โ

