Inside some complex Prometheus queries
Originally published on the Fiberplane Blog
Autometrics is an open source micro-framework for observability. It makes it trivial to instrument your code with the most useful metrics and then generates powerful Prometheus queries to help you identify and debug issues in production. This post goes into detail about how the queries it generates work.
Request rate
The first and simplest query Autometrics uses calculates the request rate for a given function.
sum by (function, module, commit, version) (
rate({__name__=~"function_calls(_count)?(_total)?",function="my_function",module="my_module"}[5m])
* on (instance, job) group_left (version, commit)
last_over_time(build_info[1s])
)
Let’s break this query down piece by piece:
**sum by (function, module, commit, version)**
- this specifies that we want our results to contain the labelsfunction
,module
,commit
, andversion
, and all other labels should be merged together. Thefunction
andmodule
identify the function from our source code, while theversion
andcommit
are properties of the whole binary.**rate({...}[5m])**
- this calculates the request rate per second, calculated over the 5 minutes preceding a given point in time. This is necessary because Prometheus counters always go up. If we just calculated the sum over time, we would be counting earlier requests many times over.**__name__=~”function_calls(_count)?(_total)?”**
- this complicated-looking label selector uses a regular expression to match metrics with one of the following names:function_calls
,function_calls_count
,function_calls_total
, orfunction_calls_count_total
. This is a temporary necessity because Prometheus/OpenMetrics and OpenTelemetry had incompatible naming conventions. That’s been sorted out though, so we’ll soon be able to replace this with justfunction_calls_total
. Also note that the metric name is just a special labelcalled __name__, so function_calls_total{}
is equivalent to{__name__=”function_calls_total”}
.**function="my_function",module=”my_module”**
- this selects the time series that corresponds to a particular function from your code.**on (instance, job) group_left (version, commit)**
- this is Prometheus’ equivalent of a left join from SQL. It uses theinstance
andjob
labels, which are automatically added by Prometheus, to group join time series. It’s then going to add theversion
andcommit
labels from the time series on the right to the series on the left.**last_over_time(build_info[1s])**
- theversion
andcommit
labels from the previous part come from a special metric that Autometrics produces calledbuild_info
. We do this so that we can track your software’s version and help you identify commits that introduced errors or latency. Importantly, however, we avoid increasing the cardinality of our main metrics and improve storage efficiency by tracking those details on a separate info metric and only using the join at query time to merge them.
That’s the request rate! Aren’t you glad you didn’t have to write that query by hand? 😉
Error rate
(
sum by (function, module, commit, version) (
rate(
{__name__=~"function_calls(_count)?(_total)?",function="my_function",module="my_module",result="error"}[5m]
)
* on (instance, job) group_left (version, commit)
last_over_time(build_info[1s])
)
)
/
(
sum by (function, module, commit, version) (
rate({__name__=~"function_calls(_count)?(_total)?",function="my_function",module="my_module"}[5m])
* on (instance, job) group_left (version, commit)
last_over_time(build_info[1s])
)
)
This query looks a lot more complicated, but it’s actually using many of the same elements from the request rate query.
Let’s break this down piece by piece as well:
**(...) / (...)**
- this is the error rate, so we’re dividing the number of errors by the total number of requests. In fact, you can see that the expression in the second set of parentheses is exactly the same as the request rate we broke down above**result="error"**
- this is actually the only difference between the first expression and the second. This label indicates that the function resulted in an error, as opposed to a successful result. Autometrics is able to determine this based on how each programming language handles errors. In Rust, it’s based on functions returning the built-in Result enum. In Python, Typescript, C# and Java, it’s based on whether a function threw an error. And in Go, it’s based on whether the function returned an error.
That gives us the error rate! It looks very complicated, but a lot of the magic comes from standardizing how we know whether a function errored, in a way that works across programming languages.
Latency
This query shows us both the 99th and 95th percentile latencies in the same graph.
label_replace(
histogram_quantile(
0.99,
sum by (le, function, module, commit, version) (
rate(function_calls_duration_bucket{function="my_function",module="my_module"}[5m])
* on (instance, job) group_left (version, commit)
last_over_time(build_info[1s])
)
),
"percentile_latency",
"99",
"",
""
)
or
label_replace(
histogram_quantile(
0.95,
sum by (le, function, module, commit, version) (
rate(function_calls_duration_bucket{function="my_function",module="my_module"}[5m])
* on (instance, job) group_left (version, commit)
last_over_time(build_info[1s])
)
),
"percentile_latency",
"95",
"",
""
)
Looking through the parts of this one, we have some similarities with the previous queries and a number of differences:
**label_replace(..., “percentile_latency”, “99”, “”, “”)**
- since we want to show both percentiles in the same graph, we need to attach different labels to the different results. Here, we’re attaching the labelpercentile_latency=”99”
. We do this by pretending to replace an empty label, because Prometheus does not have a function to simply add a label.**histogram_quantile(0.99,...)**
- we’re using the built-in function to calculate the 99th percentile from our histogram data.**sum by (le, function, module, commit, version)**
- this is mostly the same as what we saw in the request and error rate queries, with the exception of the le label. This label is used for histograms to denote that a particular histogram bucket represents observations that are less than or equal to the value of this label. Similar to the previous queries, thissum by
expression keeps the labels in the parentheses while merging any others.**rate(...[5m])**
- just like for plain counters, histogram bucket counters just go up. So if we want to look at the observations within a specific time period, we need to use the rate as opposed to the cumulative total.**function_calls_duration_bucket**
- when you create a histogram, Prometheus clients automatically create multiple time series. For each bucket you have configured, it’ll create a time series with the_bucket
suffix and the appropriate le label. It’ll also add time series withthe _total
and_count suffixes
, but we can ignore those as we’re not using them in this query.**or label_replace(..., “percentile_latency”, “95”, “”, “”)**
- and we’re doing it all again for the 95th percentile latency.
The rest should be recognizable from our previous queries!
Alerts and SLOs
As an advanced feature, Autometrics also enables you to produce alerts based on Service-Level Objectives (SLOs) that are defined in your source code.
We’ve created a single Prometheus recording and alerting rules file that uses even more fun PromQL tricks to work without customization for most Autometrics-instrumented projects. The rules are dormant by default and only enabled when Prometheus finds metrics with specific labels attached. When objectives are defined in the code, the libraries produce metrics with the right labels to activate the recording and alerting rules.
To learn more about the PromQL behind this feature, take a look at An Adventure with SLOs, Generic Prometheus Alerting Rules, and Complex PromQL Queries.
Queries are hard! Don’t write them by hand!
Autometrics was designed to give you the debugging powers of Prometheus without the pain of writing queries by hand. It is of course possible to write such queries yourself, but I wouldn’t recommend trying it during a stressful incident!
Autometrics standardizes function-level metrics, and then couples this standardization with details extracted from your source code to build powerful queries for you. You can also use these queries as a jumping off point, because it’s a lot easier to tweak an existing query than to build a whole new one from scratch.
Importantly, when using the queries provided, you know that the chart you’re looking at shows you what it purports to show you. In contrast, when writing queries by hand, there’s always the possibility that the query is syntactically valid but statistically incorrect or doesn’t actually answer the question you think it should. That can be a costly mistake if it sends you down the wrong investigative rabbit hole.
Want to add Autometrics to your project? It’s available today for Rust, Go, Python, Typescript, and C#/.NET.
Get involved
Interested in helping us write more useful and complex PromQL so others don’t have to? Come get involved in the project! You can join us on Discord, take a look at the project roadmap, and chip in to our Github Discussions.