When Can You Add Prometheus Labels Without Increasing Cardinality?

Jul 11, 2023

Originally published on the Fiberplane Blog

Prometheus

Prometheus and other Time Series Databases (TSDBs) don’t work well when your data has too many different labels. However, there are certain small cases when adding additional labels is fine. This post goes through when adding labels does not increase cardinality.

How Prometheus stores time series

To understand the root of the problem we’re trying to avoid, it’s important to have a mental model of how Prometheus stores its data.

In Prometheus, every time series is identified by a unique set of label values. For example, we might have some data like this that tracks the number of HTTP requests made to an API. The two labels are path and status. (Technically, there is a third because the metric name is actually just a special label called __name__.)

Prometheus Data Points

For every label set, Prometheus stores a data point for the value of the metric it observed at that point in time.

Cardinality explosions, or the label cardinality problem

What would happen if we added a request or trace ID as a label in our example above? We would suddenly be storing a whole separate time series for every single request we ever see. That will blow up our Prometheus instance immediately. You can see that that’s especially wasteful in this case, because we would be storing time series for ephemeral requests.

Using trace IDs as labels is an especially extreme example, but we can run into problems even with more benign-seeming labels. What about something like an organization ID? It obviously depends how many organizations your API is serving, but this can also create issues because the cardinality of different labels is compounding.

If we already have the labels path and status and then add one more like organization_id, we’re going to need a whole new time series for every different status returned from every different path for the users of each different organization.

This is the cardinality explosion. The number of unique combinations of the label values is the product of all the different possibilities for each label. Therefore, we need to ensure both that no single label has too many variants, and also that we don’t have too many variants when all of the labels are taken together.

When can we add labels without increasing cardinality?

Despite the general advice not to add too many labels, we can fearlessly add labels when they are perfectly correlated with other existing labels.

Let’s illustrate what we mean using our example from above. Say we wanted to include the HTTP status text alongside the code. (This is not a particularly useful idea but it illustrates the point.) Our table would look something like this:

Labels

In this example, we have more labels for each time series but, crucially, we have exactly the same number of time series. This is because every status_text will always map one-to-one to a specific value of the status_code.

This is a bit of a contrived example, and probably inadvisable simply because of the little bit of extra storage needed for the label text, which doesn’t provide a lot of value here.

However, this came up in a real discussion related to the Autometrics project. Autometrics instruments functions with the most useful metrics and then generates Prometheus queries using the instrumented function names.

The discussion was about adding a service_name label to all of the metrics produced by a given service in order to differentiate the metrics produced by different deployments of the same code base corresponding to different logical services. Prometheus already attaches the job and instance labels whenever it scrapes a service for metrics. As long as the service_name is always the same for a specific combination of the job and instance labels, adding this additional label won’t actually add cardinality.

Conclusion

When you’re using Prometheus to scrape, store, and analyze metrics, it’s important to avoid adding too many labels. Specifically, you need to ensure that the cardinality of your label sets isn’t too high or you’ll overwhelm Prometheus. In very special cases, however, where the new label perfectly correlates with a specific set of existing label values, it is possible to add additional labels without increasing cardinality and while keeping Prometheus happy.

A developer-friendly metrics experience with Autometrics

Autometrics is an open source micro-framework for observability built on Prometheus and OpenTelemetry. It makes it easy to instrument code with the most useful metrics, standardizes these metrics, and then generates powerful PromQL queries to help you identify and debug issues in production.

If you’re using Prometheus, or looking to get started with it, and want to track useful metrics without the hassle of figuring out what to track yourself, add Autometrics to your project today! It’s available now in Rust, Go, Python, Typescript, and C#, with more languages on the way.

#autometrics #fiberplane #observability