The Case for Function-Level Metrics

Apr 12, 2023

Originally published on the Fiberplane Blog

Function-level metrics

Cloudflare recently released an excellent blog post about how they run Prometheus at scale to monitor the health and performance of their systems. In it, Łukasz Mierzwa mainly focuses on understanding the problems with high cardinality metrics, but he also mentions the difficulties of “managing the entire lifecycle of a metric from an engineering perspective”. The challenges he goes on to list are exactly what motivated us to start the open source autometrics project. In this post, I’ll try to show how the simple idea of function-level metrics, which autometrics is built upon, neatly addresses five of the key challenges for setting up and effectively using metrics today.

Function-level metrics

So, why function-level metrics? Functions are already the building blocks of all code! They are a named unit of work and, if we’ve structured our code reasonably well, our application logic is probably already divided into sensible units.

As such, functions are an excellent building block for observability. They are often used for spans in distributed tracing. And, as we’ll see here, they are also very useful units for tracking aggregated metrics.

If we want to understand the health and performance of specific functions, what exactly should we measure? The RED method provides a good starting point: Request rate, Error rate, and Duration. It’s worth noting that the Four Golden Signals from the Google SRE book include these three as well as saturation.

Throughout this post, I’ll use the term “function-level metrics” to refer to the idea of tracking the RED metrics for functions in source code. This is a simple idea that turns out to have a profound impact on the setup, usability, and developer-friendliess of metrics. If you are interested, you can find the specific metric names and labels we use in the Appendix.

Metrics lifecycle challenges and the benefits of function-level metrics

The Cloudflare post summarizes the challenges of setting up and using metrics today as follows (emphasis and numbers mine):

Managing the entire lifecycle of a metric from an engineering perspective is a complex process.

You must define your metrics (1) in your application, with names and labels (2) that will allow you to work with resulting time series (3) easily. Then you must configure Prometheus scrapes in the correct way and deploy that to the right Prometheus server. Next you will likely need to create recording and/or alerting rules (4) to make use of your time series. Finally you will want to create a dashboard to visualize all your metrics (5) and be able to spot trends.

Let’s talk about each of these issues in turn and see how function-level metrics address them.

Note that while configuring Prometheus is an important challenge, it is out of scope for this blog post. That is more of an infrastructure concern and not something that function-level metrics can help with.

Challenges 1 & 2: Defining and naming Metrics

It’s hard to figure out what to track. The more you instrument and measure, the more insight you’ll get into your application’s state. But to add custom metrics, you’ll need to answer difficult and almost existential questions like how to know if your application is working “well” or “well enough”. What does “well enough” even mean, and what information might be helpful debugging issues in the future? These questions can be hard to answer up front and many people never get past this point.

Once you have an idea of what you want to track, you also need to come up with the metric names and labels. Naming is one of the two hard problems in computer science. And this is further complicated by different standards’ conflicting approaches to semantic conventions. Should metric names include their units? Should counters be suffixed with the word count or total? Should you use dots or underscores as namespace separators? All of these are necessary but distracting questions when naming metrics.

Using function-level metrics side-steps the problems of deciding what to track and figuring out the names for metrics and labels. As mentioned above, we are tracking the request rate, error rate, and duration of function calls. The function name is attached to the metrics as a label. We can avoid further discussion and bikeshedding.

The autometrics libraries generate RED metrics for functions in your source code using the metaprogramming techniques offered by each programming language. This makes it trivial to add useful metrics to HTTP/RPC handlers, database methods, or any other bit of application logic. You get a useful level of granularity without the endless discussions about naming or what to track.

Challenge 3: Working with (querying) time series

Once you have some metrics data, you need to query it to make use of it. Query languages like PromQL require time and effort to get comfortable with, and they often have subtle gotchas that are particularly challenging for newcomers (like always needing to take a rate() before aggregating with sum()).

An especially pernicious problem is the lingering uncertainty about whether your query returns the right data to answer a particular question. You may have written a syntactically valid query. But was it statistically meaningful? Does the chart show you what you think it shows you? It’s hard for non-experts to know for sure and it is particularly painful to discover that you’re looking at the wrong data during a high-stress incident. Systems like Prometheus cannot really help you craft the right query because they have no understanding of the semantics of your metrics data.

Defining a standard for function-level metrics enables us to automatically write queries to answer specific questions about the state of our system. For example, we use a label called result to indicate whether a function call was successful or errored. We can then automatically create a PromQL query based on this metric to calculate the success or error rate of a given function. The same applies for queries related to the request rate and latency of specific functions.

The most fun feature of the autometrics libraries is that they write queries for you and insert links to the live Prometheus charts directly into each function’s doc comments. This means you could be exploring your code, hover over a function name, and instantaneously jump to the live data for that specific function. If you want to explore further, you can use the generated queries as a starting off point, as it’s far easier to modify a query than to write one from scratch.

You can see what the experience looks like in this video demoing the Rust implementation.

Challenge 4: Alerting rules

Once you’ve set up metrics, you’ll eventually want to create alerts to let you know when the code is not performing as expected. This is a complex topic in and of itself. Many teams put off defining alerts because of the difficulty of setting up alerts that will fire when there is a real problem but won’t accidentally page sleeping developers with false alarms.

By using function-level metrics generated from the source code, we can infer enough about what “good” and “bad” cases look like to make it easy to build powerful alerts. We can leverage best practices around Service-Level Objectives (SLOs) and define objectives related to the success rate and/or latency of specific groups of functions. The various functions’ metrics can be grouped into objectives using labels. A powerful implication of defining SLOs based on groups of functions is that debugging alerts becomes easier, as you can quickly look at a graph of all of the functions that comprise a certain objective to pinpoint the source of an alert.

The autometrics libraries enable you to define useful alerts based on SLOs directly in your source code. The libraries use a single Prometheus recording and alerting rules file that will work without customization for most autometrics-instrumented projects (using some fun tricks with labels). You can read the full details of how this works in: An adventure with SLOs, generic Prometheus alerting rules, and complex PromQL queries. It is worth emphasizing that this type of feature is made possible by standardizing the mapping from functions’ code to the metrics they generate.

Challenge 5: Visualizing data

Finally, creating good dashboards to visualize your data is an art of its own. It requires writing good queries as well as deciding what information might be helpful to get an overview of the system or debug a specific issue.

Standardizing function-level metrics also enables us to build dashboards that will work out-of-the-box for any code base producing metrics with the same conventions. By default, we can include charts showing how the system is performing against its objectives, an overview of the instrumented functions, and a way to dig into the metrics for specific functions.

We’re currently working on Grafana dashboards that show various levels of details for any autometrics-instrumented project. These will be available soon through the Grafana dashboard marketplace.

Conclusion: Developer-friendly observability with function-level metrics

Tying metrics to functions is a simple idea that addresses many of the challenges with setting up and using metrics today. Standardizing function-level metrics enables useful metrics to be easily generated from source code and then enables automatically writing queries and sharing alerting rules and dashboards across projects to make the data actionable.

Linking observability data to functions in source code also makes it far more understandable to developers. Labels match the actual function names developers are familiar with, and autometrics features like inserting links to live charts directly into doc comments makes production data explorable from the IDE.

The advantages exist in reverse as well: linking metrics to code also makes it easier to go straight from an alert or a chart to debugging the specific functions that are not performing as expected. Even without deep knowledge of the code base, function-level metrics can help pinpoint the actual function causing a problem. In some cases, the first responder might already be able to spot the issue by looking at that function’s code. Or, if they need to turn the issue over to the service’s developers, pointing the developer to a particular function provides a more specific starting point for debugging and may speed up the resolution time.

Get involved!

If you’re interested in adding metrics to your functions, check out the open source autometrics libraries.

Also, if you’d like to help bring this to more programming languages or contribute to standardizing function-level metrics, please come get involved in the project – we would love to hear from you!

Appendix: Metrics and labels used by autometrics

The autometrics libraries use the following metrics and labels to track the request rate, error rate, and duration of function calls:

function.calls.count - A counter used to track both the request and error rate of functions. It has the following labels:
- function - As you might guess, this is just the name of the function.
- module - The module or file path of the function helps distinguish between functions with the same name defined multiple times in the code base.
- result - This label can either have the value ok or error. This indicates whether the function did “the good thing” or “the bad thing”, and this ultimately helps calculate the function’s error rate. (In Rust, this is based on whether the function returned a Result::Ok or Result::Err and in other languages like Javascript and Python it can be determined by whether the function threw an error.)
- caller - The name of the function that invoked this one. This enables querying the metrics for all functions called by a specific one.
function.calls.duration - A histogram used to track the latency of function calls. It uses the same function and module labels.

Note that OpenTelemetry and OpenMetrics/Prometheus use slightly different conventions with regard to separators, counter names, and units. We have an ongoing discussion about what to do about these differences.

Also, while these metrics are not yet standardized, they could theoretically be included in a standard such as the OpenTelemetry Semantic Conventions for Metrics.

#autometrics #best #fiberplane #observability