OpenAPM

OpenAPM is an Application Performance Monitoring toolkit based on Prometheus, and Grafana.

What is OpenAPM

The OpenAPM project was created to make monitoring the applications using Open Standards such as Prometheus and Grafana easier. It consists of auto-instrumentation libraries for popular languages, standard Grafana dashboards for visualization and alerting rules.

Instrumentation

OpenAPM supports auto-instrumentation of Node.js, Ruby, Golang and Python applications.

Once the OpenAPM instrumentation package is set up in your application, you can scrape the metrics via the Prometheus agent and push it to Prometheus or a Prometheus-compatible long-term storage like Last9.

info

The metrics can be sent to Prometheus using Prometheus Remote Write Protocol.

Advanced Capabilities

The OpenAPM instrumentation libraries support following additional capabilities over instrumenting RED metrics.

Multi-Tenancy support

Extract tenant information from the URL and emit as a label
Additional labels and metadata can be extracted by using Regular expressions

Default labels

Add any number of constant default labels apart from the request metrics

Track application lifecycle events

Support for tracking change events such as application_started.

Querying

OpenAPM data is stored in Prometheus compatible systems, so one can use PromQL to query the data.

Visualization

OpenAPM collection ships with ready to import Grafana dashboards for applications to understand key performance indicators such as Apdex score, slow endpoints and slow database queries among many others.

Steps to import default dashboard

Import this dashboard into Grafana
Set up the data source to the one where metrics from OpenAPM are getting sent
Save the dashboard

APM Dashboard - RED Metrics APM Dashboard - DB Metrics APM Dashboard - Infra Metrics

Alerting

OpenAPM comes with standard alert definitions that can be used to monitor the applications.

Apdex score calculation

Apdex stands for Application Performance Index. It is a measure of application performance with respect to user satisfaction.

Why is the Apdex score a better measure of application performance?

The definition of application performance varies as per different personas. For the engineer, it's response time; for SRE, it's uptime; for the product manager, it is user retention on the product. These definitions vary as per the roles or priorities of each role. In such cases, we have to find a uniform way to measure the application performance across multiple services, applications, and teams. How to do that?

Apdex solves that problem by quantifying user experience to a number, as at the end of any business all roles are trying to find user experience by using different measurements.

To define Apdex in terms of user experiences, we take three different categories of users.

Satisfied: One who enjoys the application experience without hiccups or slowness
Tolerating: One who faces a lag or slowness but keeps using the application without complaining
Frustrated: One who abandoned the application due to lag or slowness

Based on application response time, the Apdex score measures satisfied, tolerating, and frustrated users.

How to find an Apdex score for an HTTP application?

Define a satisfied user response time threshold, for example, a happy customer would get a response within 1 second
4 times the satisfied user threshold defines the threshold for tolerating users. So, using the above example, anything from 1 second to 4 seconds defines the tolerating user response time. Anything above that defines frustrated user response time

Find the Apdex score by:

Apdex score = (
                  No. of satisfied users +
                  (0.5) * No. of tolerating users
                   + 0 * No. of frustrated users
              ) / Total users

For any application, the number of users can be quantified by the number of requests, which is throughput.

Assumptions:

Application received 1000 requests in total
The satisfied user threshold is 1 second
No. of requests finished within 1 second = 600
No. of requests finished within 1 to 4 seconds = 200
No. of requests finished greater than 4 seconds = 200

Apdex Score will be:

Satisfied users: (600 * 1 = 600)
Tolerating users: (200 * 0.5 = 100)

Frustrating users: (200 * 0 = 0)

Apdex score = (600 + 100 + 0 ) / 1000 = 0.7

PromQL to calculate Apdex score

For calculating the Apdex score, OpenAPM utilizes following metrics:

http_requests_duration_milliseconds_bucket which denotes the latency of the application
http_requests_duration_milliseconds_count which denotes the throughput of the application
OpenAPM uses P50 as the measure of satisfied users. This is dynamic threshold which measures user experience in real time
```
histogram_quantile(0.50, sum by (le) (rate(http_requests_duration_milliseconds_bucket{}[4m])*60))
```

Satisfied Users as satisfied_users

 sum(
   topk(1, sum by (le) (rate(http_requests_duration_milliseconds_bucket{}[4m]) * 60)
   and
   label_value(sum by (le)(http_requests_duration_milliseconds_bucket{}), "le") <  threshold))

Tolerating Users as tolerating_users

sum(
   topk(1, sum by (le) (rate(http_requests_duration_milliseconds_bucket{}[4m]) * 60)
   and
   label_value(sum by (le)(http_requests_duration_milliseconds_bucket{}), "le") <  4 * threshold))

-

sum(
   topk(1, sum by (le) (rate(http_requests_duration_milliseconds_bucket{}[4m]) * 60)
   and
   label_value(sum by (le)(http_requests_duration_milliseconds_bucket{}), "le") <  threshold))

Total Users as total_users which considers total requests

sum (rate(http_requests_duration_milliseconds_count{}[4m]) * 60))

Final PromQL

Apdex score = ( satisfied_users + ( tolerating_users / 2 ) ) / total_users

The final PromQL to calculate the Apdex score is as follows.

((sum(topk(1,sum by (le)(rate(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}[4m])*60) and label_value(sum by (le)(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}), "le") <  histogram_quantile(0.50, sum by (le)(rate(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}[4m])*60)))) +

(sum(topk(1,sum by (le)(rate(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}[4m])*60) and label_value(sum by (le)(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}), "le") <  histogram_quantile(0.90, sum by (le)(rate(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}[4m])*60)))) - sum(topk(1,sum by (le)(rate(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}[4m])*60) and label_value(sum by (le)(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}), "le") <  histogram_quantile(0.50, sum by (le)(rate(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}[4m])*60)))))/2)
/
sum (rate(http_requests_duration_milliseconds_count{program=~"$program", version=~"$version", environment=~"$environment"}[4m])*60))

tip

You don't need to remember this complex PromQL. Last9 clusters are equipped with PromQL macros and you just have to use

apdex_score(0.5, "<service_name>", http_requests_duration_milliseconds_bucket, http_requests_duration_milliseconds_count)

Explaining Functions used by the PromQL

`label_value`

label_value(sum by (le)(http_requests_duration_milliseconds_bucket{}), "le") <  threshold

label_value is used to get any label value from a query. Whatever labels are coming as part of output series from query, we can select which label values we are interested in by providing it's label as 2nd argument to label_value Here threshold is latency (P50) so it must be either in seconds or milliseconds. On receiving label value le which gives time buckets, we are filtering all buckets which has value less than threshold.

`and` operator

sum by (le) (rate(http_requests_duration_milliseconds_bucket{}[4m]) * 60)
     and
label_value(sum by (le)(http_requests_duration_milliseconds_bucket{}), "le")

and operator is used to finding intersection between first query and second query. Here it will give all those timeseries where ouput timeseries of first query is intersecting with output timeseries of second query.

`topk`

topk(1, sum by (le) (rate(http_requests_duration_milliseconds_bucket{}[4m]) * 60)
     and
     label_value(sum by (le)(http_requests_duration_milliseconds_bucket{}), "le") <  threshold)

topk is used to get top k number of series with highest value from query.

Understaing reason behind [4m]

The Counter is one of the metric types supported by Prometheus-compatible systems. The nature of these metrics is monotonically increasing. Instant values of any such metric are barely helpful. So, to find meaningful information, we have to find the rate of change of these metrics. Prometheus-compatible systems provide three functions to find the rate of change of counters:

rate
increase
irate

To use any of the above functions, Prometheus needs at least 2 data points; otherwise, it will return the single point itself. The recommended duration that specifies a sufficient number of data points to be received is four times the scrape interval so that we have at least >= 2 data points. rate is per second average of metric value over the duration specified (for the example, it is [4m], Prometheus finds this value by finding the slope from the start value to the end value for the duration. As Last9 is Prometheus compatible TDSB, same duration is used in the Apdex score calculation.

Getting Started

Follow the quickstart guide for nodejs applications to know more.

What is OpenAPM​

Instrumentation​

Advanced Capabilities​

Multi-Tenancy support​

Default labels​

Track application lifecycle events​

Querying​

Visualization​

Steps to import default dashboard​

Alerting​

Apdex score calculation​

Why is the Apdex score a better measure of application performance?​

How to find an Apdex score for an HTTP application?​

PromQL to calculate Apdex score​

Explaining Functions used by the PromQL​

label_value​

and operator​

topk​

Understaing reason behind [4m]​

Getting Started​