Inside some complex Prometheus queries

Jul 05, 2023

Originally published on the Fiberplane Blog

Inside a Prometheus query

Autometrics is an open source micro-framework for observability. It makes it trivial to instrument your code with the most useful metrics and then generates powerful Prometheus queries to help you identify and debug issues in production. This post goes into detail about how the queries it generates work.

Request rate

The first and simplest query Autometrics uses calculates the request rate for a given function.

sum by (function, module, commit, version) (
    rate({__name__=~"function_calls(_count)?(_total)?",function="my_function",module="my_module"}[5m])
  * on (instance, job) group_left (version, commit)
    last_over_time(build_info[1s])
)

Let’s break this query down piece by piece:

**sum by (function, module, commit, version)** - this specifies that we want our results to contain the labels function, module, commit, and version, and all other labels should be merged together. The function and module identify the function from our source code, while the version and commit are properties of the whole binary.
**rate({...}[5m])** - this calculates the request rate per second, calculated over the 5 minutes preceding a given point in time. This is necessary because Prometheus counters always go up. If we just calculated the sum over time, we would be counting earlier requests many times over.
**__name__=~”function_calls(_count)?(_total)?”** - this complicated-looking label selector uses a regular expression to match metrics with one of the following names: function_calls, function_calls_count, function_calls_total, or function_calls_count_total. This is a temporary necessity because Prometheus/OpenMetrics and OpenTelemetry had incompatible naming conventions. That’s been sorted out though, so we’ll soon be able to replace this with just function_calls_total. Also note that the metric name is just a special label called __name__, so function_calls_total{} is equivalent to {__name__=”function_calls_total”}.
**function="my_function",module=”my_module”** - this selects the time series that corresponds to a particular function from your code.
**on (instance, job) group_left (version, commit)** - this is Prometheus’ equivalent of a left join from SQL. It uses the instance and job labels, which are automatically added by Prometheus, to group join time series. It’s then going to add the version and commit labels from the time series on the right to the series on the left.
**last_over_time(build_info[1s])** - the version and commit labels from the previous part come from a special metric that Autometrics produces called build_info. We do this so that we can track your software’s version and help you identify commits that introduced errors or latency. Importantly, however, we avoid increasing the cardinality of our main metrics and improve storage efficiency by tracking those details on a separate info metric and only using the join at query time to merge them.

That’s the request rate! Aren’t you glad you didn’t have to write that query by hand? 😉

Error rate

(
    sum by (function, module, commit, version) (
        rate(
          {__name__=~"function_calls(_count)?(_total)?",function="my_function",module="my_module",result="error"}[5m]
        )
      * on (instance, job) group_left (version, commit)
        last_over_time(build_info[1s])
    )
  )
/
  (
    sum by (function, module, commit, version) (
        rate({__name__=~"function_calls(_count)?(_total)?",function="my_function",module="my_module"}[5m])
      * on (instance, job) group_left (version, commit)
        last_over_time(build_info[1s])
    )
  )

This query looks a lot more complicated, but it’s actually using many of the same elements from the request rate query.

Let’s break this down piece by piece as well:

**(...) / (...)** - this is the error rate, so we’re dividing the number of errors by the total number of requests. In fact, you can see that the expression in the second set of parentheses is exactly the same as the request rate we broke down above
**result="error"** - this is actually the only difference between the first expression and the second. This label indicates that the function resulted in an error, as opposed to a successful result. Autometrics is able to determine this based on how each programming language handles errors. In Rust, it’s based on functions returning the built-in Result enum. In Python, Typescript, C# and Java, it’s based on whether a function threw an error. And in Go, it’s based on whether the function returned an error.

That gives us the error rate! It looks very complicated, but a lot of the magic comes from standardizing how we know whether a function errored, in a way that works across programming languages.

Latency

This query shows us both the 99th and 95th percentile latencies in the same graph.

label_replace(
    histogram_quantile(
      0.99,
      sum by (le, function, module, commit, version) (
          rate(function_calls_duration_bucket{function="my_function",module="my_module"}[5m])
        * on (instance, job) group_left (version, commit)
          last_over_time(build_info[1s])
      )
    ),
    "percentile_latency",
    "99",
    "",
    ""
  )
or
  label_replace(
    histogram_quantile(
      0.95,
      sum by (le, function, module, commit, version) (
          rate(function_calls_duration_bucket{function="my_function",module="my_module"}[5m])
        * on (instance, job) group_left (version, commit)
          last_over_time(build_info[1s])
      )
    ),
    "percentile_latency",
    "95",
    "",
    ""
  )

Looking through the parts of this one, we have some similarities with the previous queries and a number of differences:

**label_replace(..., “percentile_latency”, “99”, “”, “”)** - since we want to show both percentiles in the same graph, we need to attach different labels to the different results. Here, we’re attaching the label percentile_latency=”99”. We do this by pretending to replace an empty label, because Prometheus does not have a function to simply add a label.
**histogram_quantile(0.99,...)** - we’re using the built-in function to calculate the 99th percentile from our histogram data.
**sum by (le, function, module, commit, version)** - this is mostly the same as what we saw in the request and error rate queries, with the exception of the le label. This label is used for histograms to denote that a particular histogram bucket represents observations that are less than or equal to the value of this label. Similar to the previous queries, this sum by expression keeps the labels in the parentheses while merging any others.
**rate(...[5m])** - just like for plain counters, histogram bucket counters just go up. So if we want to look at the observations within a specific time period, we need to use the rate as opposed to the cumulative total.
**function_calls_duration_bucket** - when you create a histogram, Prometheus clients automatically create multiple time series. For each bucket you have configured, it’ll create a time series with the _bucket suffix and the appropriate le label. It’ll also add time series with the _total and _count suffixes, but we can ignore those as we’re not using them in this query.
**or label_replace(..., “percentile_latency”, “95”, “”, “”)** - and we’re doing it all again for the 95th percentile latency.

The rest should be recognizable from our previous queries!

Alerts and SLOs

As an advanced feature, Autometrics also enables you to produce alerts based on Service-Level Objectives (SLOs) that are defined in your source code.

We’ve created a single Prometheus recording and alerting rules file that uses even more fun PromQL tricks to work without customization for most Autometrics-instrumented projects. The rules are dormant by default and only enabled when Prometheus finds metrics with specific labels attached. When objectives are defined in the code, the libraries produce metrics with the right labels to activate the recording and alerting rules.

To learn more about the PromQL behind this feature, take a look at An Adventure with SLOs, Generic Prometheus Alerting Rules, and Complex PromQL Queries.

Queries are hard! Don’t write them by hand!

Autometrics was designed to give you the debugging powers of Prometheus without the pain of writing queries by hand. It is of course possible to write such queries yourself, but I wouldn’t recommend trying it during a stressful incident!

Autometrics standardizes function-level metrics, and then couples this standardization with details extracted from your source code to build powerful queries for you. You can also use these queries as a jumping off point, because it’s a lot easier to tweak an existing query than to build a whole new one from scratch.

Importantly, when using the queries provided, you know that the chart you’re looking at shows you what it purports to show you. In contrast, when writing queries by hand, there’s always the possibility that the query is syntactically valid but statistically incorrect or doesn’t actually answer the question you think it should. That can be a costly mistake if it sends you down the wrong investigative rabbit hole.

Want to add Autometrics to your project? It’s available today for Rust, Go, Python, Typescript, and C#/.NET.

Get involved

Interested in helping us write more useful and complex PromQL so others don’t have to? Come get involved in the project! You can join us on Discord, take a look at the project roadmap, and chip in to our Github Discussions.

#autometrics #fiberplane #observability