Evan Schwartz

Inside some complex Prometheus queries

Originally published on the Fiberplane Blog

Inside a Prometheus query

Autometrics is an open source micro-framework for observability. It makes it trivial to instrument your code with the most useful metrics and then generates powerful Prometheus queries to help you identify and debug issues in production. This post goes into detail about how the queries it generates work.

Request rate

The first and simplest query Autometrics uses calculates the request rate for a given function.

sum by (function, module, commit, version) (
    rate({__name__=~"function_calls(_count)?(_total)?",function="my_function",module="my_module"}[5m])
  * on (instance, job) group_left (version, commit)
    last_over_time(build_info[1s])
)

Let’s break this query down piece by piece:

  1. **sum by (function, module, commit, version)** - this specifies that we want our results to contain the labels function, module, commit, and version, and all other labels should be merged together. The function and module identify the function from our source code, while the version and commit are properties of the whole binary.

  2. **rate({...}[5m])** - this calculates the request rate per second, calculated over the 5 minutes preceding a given point in time. This is necessary because Prometheus counters always go up. If we just calculated the sum over time, we would be counting earlier requests many times over.

  3. **__name__=~”function_calls(_count)?(_total)?”** - this complicated-looking label selector uses a regular expression to match metrics with one of the following names: function_calls, function_calls_count, function_calls_total, or function_calls_count_total. This is a temporary necessity because Prometheus/OpenMetrics and OpenTelemetry had incompatible naming conventions. That’s been sorted out though, so we’ll soon be able to replace this with just function_calls_total. Also note that the metric name is just a special label called __name__, so function_calls_total{} is equivalent to {__name__=”function_calls_total”}.

  4. **function="my_function",module=”my_module”** - this selects the time series that corresponds to a particular function from your code.

  5. **on (instance, job) group_left (version, commit)** - this is Prometheus’ equivalent of a left join from SQL. It uses the instance and job labels, which are automatically added by Prometheus, to group join time series. It’s then going to add the version and commit labels from the time series on the right to the series on the left.

  6. **last_over_time(build_info[1s])** - the version and commit labels from the previous part come from a special metric that Autometrics produces called build_info. We do this so that we can track your software’s version and help you identify commits that introduced errors or latency. Importantly, however, we avoid increasing the cardinality of our main metrics and improve storage efficiency by tracking those details on a separate info metric and only using the join at query time to merge them.

That’s the request rate! Aren’t you glad you didn’t have to write that query by hand? 😉

Error rate

(
    sum by (function, module, commit, version) (
        rate(
          {__name__=~"function_calls(_count)?(_total)?",function="my_function",module="my_module",result="error"}[5m]
        )
      * on (instance, job) group_left (version, commit)
        last_over_time(build_info[1s])
    )
  )
/
  (
    sum by (function, module, commit, version) (
        rate({__name__=~"function_calls(_count)?(_total)?",function="my_function",module="my_module"}[5m])
      * on (instance, job) group_left (version, commit)
        last_over_time(build_info[1s])
    )
  )

This query looks a lot more complicated, but it’s actually using many of the same elements from the request rate query.

Let’s break this down piece by piece as well:

That gives us the error rate! It looks very complicated, but a lot of the magic comes from standardizing how we know whether a function errored, in a way that works across programming languages.

Latency

This query shows us both the 99th and 95th percentile latencies in the same graph.

label_replace(
    histogram_quantile(
      0.99,
      sum by (le, function, module, commit, version) (
          rate(function_calls_duration_bucket{function="my_function",module="my_module"}[5m])
        * on (instance, job) group_left (version, commit)
          last_over_time(build_info[1s])
      )
    ),
    "percentile_latency",
    "99",
    "",
    ""
  )
or
  label_replace(
    histogram_quantile(
      0.95,
      sum by (le, function, module, commit, version) (
          rate(function_calls_duration_bucket{function="my_function",module="my_module"}[5m])
        * on (instance, job) group_left (version, commit)
          last_over_time(build_info[1s])
      )
    ),
    "percentile_latency",
    "95",
    "",
    ""
  )

Looking through the parts of this one, we have some similarities with the previous queries and a number of differences:

The rest should be recognizable from our previous queries!

Alerts and SLOs

As an advanced feature, Autometrics also enables you to produce alerts based on Service-Level Objectives (SLOs) that are defined in your source code.

We’ve created a single Prometheus recording and alerting rules file that uses even more fun PromQL tricks to work without customization for most Autometrics-instrumented projects. The rules are dormant by default and only enabled when Prometheus finds metrics with specific labels attached. When objectives are defined in the code, the libraries produce metrics with the right labels to activate the recording and alerting rules.

To learn more about the PromQL behind this feature, take a look at An Adventure with SLOs, Generic Prometheus Alerting Rules, and Complex PromQL Queries.

Queries are hard! Don’t write them by hand!

Autometrics was designed to give you the debugging powers of Prometheus without the pain of writing queries by hand. It is of course possible to write such queries yourself, but I wouldn’t recommend trying it during a stressful incident!

Autometrics standardizes function-level metrics, and then couples this standardization with details extracted from your source code to build powerful queries for you. You can also use these queries as a jumping off point, because it’s a lot easier to tweak an existing query than to build a whole new one from scratch.

Importantly, when using the queries provided, you know that the chart you’re looking at shows you what it purports to show you. In contrast, when writing queries by hand, there’s always the possibility that the query is syntactically valid but statistically incorrect or doesn’t actually answer the question you think it should. That can be a costly mistake if it sends you down the wrong investigative rabbit hole.

Want to add Autometrics to your project? It’s available today for Rust, Go, Python, Typescript, and C#/.NET.

Get involved

Interested in helping us write more useful and complex PromQL so others don’t have to? Come get involved in the project! You can join us on Discord, take a look at the project roadmap, and chip in to our Github Discussions.

#autometrics #fiberplane #observability