autometrics-rs 0.4: Spot commits that introduce errors or slow down your application

Apr 27, 2023

Originally published on the Fiberplane Blog

Autometrics Logo

Autometrics-rs v0.4 introduces a feature I am especially excited about: it enables you to easily identify the commit or version of your code that introduced errors or added latency. Many issues in production are caused by new code being rolled out, and the first question many developers or SREs investigating an incident will ask is “what changed recently?” This post shows how combining function-level metrics with build information makes debugging problems easier and it explains the Prometheus tricks we used to make it work.

Autometrics is an open source observability framework that helps you understand the health and performance of your system in production. It does this by making it trivial to add track useful metrics for functions in your code and then automatically writing Prometheus queries to help you understand the generated data. The version and commit feature introduced here builds on this core principle to help you get even more actionable information out of the data without any effort on your part.

use autometrics::autometrics;

#[autometrics]
pub async fn create_user() {
  // Now I have metrics!
}

Identifying problematic versions with Autometrics

autometrics-rs 0.4 attaches the version and commit as labels in the queries and graphs it generates. This makes problematic versions stick out immediately.

The example below illustrates the query and chart for an imaginary release that suddenly raised the error rate from around 2% to around 60%:

Prometheus graph

Looking at a chart like this, we would immediately be able to see that the error rate spike was correlated with the release of a specific new version. To investigate the issue further, we can go look at how this particular function changed in that release. This is an extremely powerful debugging capability that few teams have at their fingertips today.

The queries and charts produced by Autometrics now include this information by default, so everyone can benefit from these types of insights without needing an in-house PromQL expert. In the future, we may also build alerts that watch for regressions at the function level to proactively warn you about potential issues.

But what about label cardinality? 🙃

Now, how does this feature work?

Metrics systems like Prometheus don’t perform well when the time series they store have labels with too much variability. This is because time series are indexed by the combination of all of their label keys and values, so highly variable labels require lots of additional storage. If you’re interested in more details on this problem and how Prometheus stores metrics, I would recommend reading this deep dive post from Cloudflare.

Thankfully, we were able to make use of the version and commit information without adding additional labels to our main metrics. Let’s look at how this works.

`build_info` as a separate metric

In his 2016 post Exposing the software version to Prometheus, Brian Brazil lays out how it’s better to use a separate metric to track your software’s version information than attaching those details as additional labels.

The crux of the idea is to have a metric called something like build_info that has labels for version, commit, and other such details. The value of this metric will always stay at 1 so storing it doesn’t add much overhead. Then, when querying your data, you join together the metric you actually care about with the build_info to effectively attach the software version labels to the real metric in the chart.

autometrics-rs 0.4 adds the build_info as a separate metric automatically. It uses the version that is provided by Cargo by default and it detects the environment variables set by the vergen crate to easily collect and attach the git commit and branch.

Writing more complex – and useful – queries for you

The whole point of Autometrics is to make it trivial to collect useful metrics and to help you actually understand and use the data.

autometrics-rs 0.4 includes the build_info metric in the queries it generates for you, so you can immediately make use of the information to spot problematic versions or commits without needing to be a PromQL expert.

If you’re interested in the nitty-gritty details of how this works, read on. Otherwise, you can skip to the conclusion and get started with Autometrics right away!

Joining `build_info` with other metrics in queries

Well, you asked for the details. So, here are the details.

Prior to this version, Autometrics could generate queries like the following one to help you understand the request rate of a particular function:

sum by (function, module) (
    rate(function_calls_count{function="my_function"}[5m]
)

(If you are interested in learning more about the specific metrics Autometrics uses, take a look at The Case for Function-Level Metrics.)

Now, the query looks like this:

sum by (function, module, version, commit) (
    rate(function_calls_count{function="my_function"}[5m])
    * on (instance, job) group_left(version, commit) last_over_time(build_info[1s])
)

This query adds the bit * on (instance, job) group_left(version, commit) last_over_time(build_info[1s]), which was based on the blog post from Brian Brazil.

How exactly does this work? To be honest, even after writing the query and verifying that it worked as expected, I needed to use PromLens to understand what it’s actually doing:

The * means we’re multiplying the time series on the left side by the complex mess on the right side. In our case, this means multiplying the value we’re calculating from function_calls_count by build_info. As we mentioned above, build_info always has a value of 1 so this calculation won’t change the numerical values.
on (instance, job) tells Prometheus how we want to match the values in the function_calls_count series with those in the build_info series. The instance and job labels are added automatically by Prometheus when it scrapes a service, so we are using those to match the function-level metrics with the build information coming from a specific service.
group_left(version, commit) means that for every data point in the time series on the left, we should add the version and commit labels coming from the series on the right. Here, this means adding the version and commit labels from build_info to the time series computed from function_calls_count.
last_over_time(build_info[1s]) addresses an issue we found with Brian Brazil’s original approach. When a new version is deployed, the application will only expose the new version and commit info. However, Prometheus holds onto time series – even those that did not appear in the most recent scrape – for a few minutes and includes them in query responses. This results in two sets of build_info labels being returned. That causes the group_left to fail because Prometheus doesn’t know whether to apply the new or old set of labels. Using the last_over_time function ensures that no data is returned for the old version after that time series has stopped appearing in the scrapes.
Finally, the sum by (function, module, version, commit) at the beginning just means that Prometheus should leave those four labels as separate dimensions of the final time series. Any additional labels (including instance and job) that are not included in the parameters to sum by will be merged together.

Making `build_info` optional in cross-implementation dashboards

One final bit of fun is how we make the build_info metric optional for the Grafana dashboards, which work for all projects instrumented with one of the implementations of Autometrics.

Instead of using build_info directly, we use (build_info or on (instance, job) up). This is based on another excellent idea from Brian Brazil in Existential issues with metrics.

If the build_info metric is present (meaning the feature has been implemented by the autometrics library), it will use that. Otherwise, it will fall back to using the up metric that Prometheus includes automatically. The up metric has the instance and job labels so it can be grouped with either build_info or function_calls_count. But up does not have the version or commit labels so those will simply be left off of the final results.

Conclusion

Implementing this feature strengthened my feeling that Prometheus is a powerful tool for understanding production systems, but useful queries are hard to write by hand.

Autometrics aims to improve the developer experience of using Prometheus metrics by making it trivial to instrument your code and by writing queries for you. The latest version makes it easy to spot when a specific version introduces errors or increases latencies, which should go a long way in helping you debug problems that come from code changes. Overall, this also provides an example of the power that comes from combining instrumenting code with automatically writing complex queries to answer human-level questions about your data.

Get involved

This feature was actually suggested by an audience member at a presentation I gave about Autometrics to the Berlin Rust and Tell Meetup group. More specifically, he asked if this was something we already supported and the answer was no but I was immediately excited to go add it.

If you have other feature ideas or want to get involved in the Autometrics project, come join us on Discord or contribute to the Github Discussions. The project is still very new so we would love your input on anything ranging from features we could add to the developer experience today.

Autometrics is on Github and crates.io and you can use it to add the most useful metrics to your Rust (and Typescript/Go/Python) code in just a few minutes. Note that not all of the implementations have the build_info feature yet, but it’ll be coming to them soon.

#autometrics #fiberplane #observability #rust