An adventure with SLOs, generic Prometheus alerting rules, and complex PromQL queries

Mar 16, 2023

Originally published on the Fiberplane Blog

Autometrics is a set of open source libraries that make it easy to instrument code with the most useful metrics and help developers understand metrics data by automatically writing Prometheus queries. Autometrics inserts links to live Prometheus charts directly into each instrumented function’s doc comments. We are also working on providing a Grafana dashboard and Prometheus alerting rules for autometrics projects.

This blog post is about the journey we went on to craft a good developer experience around Prometheus alerts for autometrics-instrumented projects.

Background on SLOs

Before proceeding, it’s worth giving a tiny bit of background on Service-Level Objectives (SLOs).

A naive approach to monitoring and alerting fires an alert as soon as a certain error threshold is reached. However, this doesn’t account for times when there may be a momentary spike that’s already gone back down or a slower burning problem that never crosses that predefined threshold but creates a bad experience for a growing set of users.

With SLOs, you define an objective like “99.9% of API requests should return a successful result”. This, in turn, determines the “error budget”, which in this case would allow 0.1% of requests to fail.

When building alerts for SLOs, you end up with multiple alert definitions that cover different cases like sudden error spikes and slow burning issues. You can read more about alerting with SLOs in the Google SRE workbook and Sloth is a handy tool for generating alerting rules from simple SLO definitions.

For autometrics, we wanted to leverage these best practices while making the experience even easier for developers than thinking about and defining SLOs in YAML.

Version 1: Generating Alerts from Code

Autometrics instruments code at the function level, so our first approach was to add another small annotation in the source code to create SLOs for specific functions. Then, the library provided a function that could be used to generate the Prometheus alerting rules.

While this approach worked, the developer experience would not be great. You need to create and run a separate subcommand on your app, and then configure Prometheus with those rules.

Could we create a set of Prometheus alerting rules that would work for any autometrics-instrumented project? Autometrics creates metrics using a specific naming and labeling scheme so we might be able to use (or abuse) labels to make this work…

Version 2: Generic Alerting Rules

The principle behind creating a set of generic alerting rules is to use a set of standardized metric names and labels such that a single set of alerting rules would work for any autometrics project.

In order to make this work, we need a way to use labels to specify:

which function or group of function metrics belong to a certain SLO
whether the SLO relates to the success rate or latency of a function
the objective (95%, 99%, 99.9%, etc)
for latency objectives, we also need the target latency (250ms, 500ms, etc)

An important simplifying principle is that we do not need this to work for absolutely every project. We’re just aiming for a 60-80% solution that would work in most cases. For more specialized users or use cases, there’s always the option of writing the rules by hand.

1. Identifying which functions belong to which SLO

In the first version of autometrics alerts, we generated PromQL queries for each SLO-relevant function using the function’s name (which is attached as a label to autometrics-produced metrics).

To do this in a generic way, we can simply identify the SLO by a user-specified objective_name. Then, the recording rules (analogous to SQL indices) will group the metrics by that SLO name.

2. Success rate or latency

This one is also easy because the two types of SLOs will be applied to two different metrics. Autometrics uses the metric name function.calls.count with a label result that can either be ok or error to track the success rate and it uses a histogram called function.calls.duration to track the latency.

3. Predefined objectives

This is another area where a simplification comes in handy. We assume that most objective percentages will either be 90%, 95%, 99%, or 99.9%. (If you need others, you can use the autometrics CLI to generate an SLO file, which you can then point Sloth to to regenerate the Prometheus rules file.)

With this simplification, we can create rules for each of these SLOs and use a label to identify which should be used (e.g. objective_percentile=”99”). This means that we will have a number of recording rules that will contain empty time series for each of the objectives not chosen. When metrics are produced with the objective_percentile label, the recording and alerting rules will spring into action.

4. Specifying the target latency

This proved to be the trickiest one to achieve.

Background on Prometheus histograms

Before getting to the solution, it’s important to give a bit of background on Prometheus histograms, which is the type of metric used to track latency.

Most Prometheus client libraries require you to specify the “buckets” for a histogram. Instead of recording every individual value, it keeps a counter of how many events fell below each bucket cutoff point. This is denoted using the le label, for “less than or equal to”. This label is added automatically by the Prometheus client libraries that autometrics then uses.

Calculating the number of slow requests

As mentioned above, Prometheus client libraries use pre-configured buckets with the upper limit denoted by the le label to track latencies. Therefore, the number of requests that take longer than a given threshold is the total minus the number of requests whose latency is less than or equal to the threshold.

The problem is that different projects will have different histogram buckets configured. Autometrics libraries provide optional default settings but you may have a more specific idea of the latency buckets you want to track. This means that we cannot make assumptions about the value of the le label that will be present on our time series.

Thus, we needed a way to allow you to specify which of your bucket thresholds is the target for the SLO – without us knowing the possible bucket cutoff points. The answer involves creating another label called target_latency, but comparing two separate labels in Prometheus is not a trivial task.

Checking for Prometheus label equality

How hard could it be to check if two labels are the same?

Attempt 1: Built-in function?

PromQL has only two built-in functions for manipulating time series labels: label_join and label_replace`. On first glance, neither of these seem sufficient to check if two labels are the same. Victoria Metrics’ MetricsQL, which is a superset of PromQL, has additional label manipulation functions, but it isn’t widely supported enough for us to use instead of stock PromQL.

Attempt 2: Regular Expressions?

Aha! PromQL supports using regular expressions both in the label_replace function and in label filters. Maybe we could use regexes to check if the objective_latency_threshold and le labels are equal? Alas, Prometheus uses RE2, which intentionally does not support lookahead and look behind queries.

Attempt 3: Label renaming and set intersections

Third time’s the charm. It is indeed possible to filter time series based on two labels matching using set intersections and an overly complicated-looking PromQL query:

label_join(rate(function_calls_duration_bucket[5m]), "autometrics_check_label_equality", "", "objective_latency_threshold")
and
label_join(rate(function_calls_duration_bucket[5m]), "autometrics_check_label_equality", "", "le")

This query:

Creates additional time series where the objective_latency_threshold label is copied over to another label called autometrics_check_label_equality using the label_join function
Creates yet another series where the le label (the histogram bucket) is copied over to the label autometrics_check_label_equality (the same label name as before)
Then it uses the and operator (set intersection) to return only the time series where all of the labels match. This has the effect of returning only the events where objective_latency_threshold is equal to the le label. (For a helpful illustration of how the various PromQL grouping methods work, take a look at this blog post on Prometheus Vector Matching.)

This gives us all of the events where the latency was less than or equal to the target latency specified in a label, independent of the buckets configured for the histogram. We can then subtract this count from the total number of events to determine the number of requests whose latencies exceed our target.

Putting it all together: generic alerting rules and a simple API for custom SLOs

What does all this label shenanigans give us? A nice and simple, developer-first approach to alerts and SLOs.

The tricks detailed above enable us to create a single set of alerting rules that will work for any autometrics-instrumented project in any programming language. The various autometrics implementations can then “enable” recording and alerting rules by attaching specific labels to the metrics they create.

You don’t need to hand-write any PromQL or YAML and we don’t even need to generate alert definitions from your code.

This enables us to provide a very simple API for creating SLOs tailored to your use case. In Rust, it looks like:

use autometrics::{autometrics, objectives::*};

const API_SLO: Objective = Objective::new("api")
    .success_rate(ObjectivePercentile::P99_9);

#[autometrics(objective = API_SLO)]
pub fn create_user_handler() {
    // ...
}

#[autometrics(objective = API_SLO)]
pub fn get_user_handler() {
    // ...
}

You can see the full docs for creating SLOs in the Rust autometrics library here.

The above example would trigger an alert with the labels objective_name=”api” and objective_percentile=”99.9” if those instrumented API handlers return too many errors.

Conclusion

Autometrics is all about providing an approach to observability that makes it easy for application developers – not just dedicated SREs or DevOps engineers – to understand what code is doing in production. Hopefully you’ve been able to see in this blog post that we’ve put a lot of thought into enabling developers to leverage SLO and alerting best practices, without writing complex queries or loads of configuration files.

Want to add autometrics to your system or get involved in the project? Check out https://github.com/autometrics-dev.

Thanks to @hatchan for the idea that we could provide a single Grafana dashboard for all autometrics projects and to @sinkingpoint for the inspiration to think of creative ways to (ab)use Prometheus labels.

#autometrics #fiberplane #observability