Skip to main content

OpenAPM

OpenAPM is an Application Performance Monitoring toolkit based on Prometheus, and Grafana.

What is OpenAPM

The OpenAPM project was created to make monitoring the applications using Open Standards such as Prometheus and Grafana easier. It consists of auto-instrumentation libraries for popular languages, standard Grafana dashboards for visualization and alerting rules.

Instrumentation

OpenAPM supports auto-instrumentation of Node.js, Ruby, Golang and Python applications.

Once the OpenAPM instrumentation package is set up in your application, you can scrape the metrics via the Prometheus agent and push it to Prometheus or a Prometheus-compatible long-term storage like Last9.

info

The metrics can be sent to Prometheus using Prometheus Remote Write Protocol.

Advanced Capabilities

The OpenAPM instrumentation libraries support following additional capabilities over instrumenting RED metrics.

Multi-Tenancy support

  • Extract tenant information from the URL and emit as a label
  • Additional labels and metadata can be extracted by using Regular expressions

Default labels

  • Add any number of constant default labels apart from the request metrics

Track application lifecycle events

  • Support for tracking change events such as application_started.

Querying

OpenAPM data is stored in Prometheus compatible systems, so one can use PromQL to query the data.

Visualization

OpenAPM collection ships with ready to import Grafana dashboards for applications to understand key performance indicators such as Apdex score, slow endpoints and slow database queries among many others.

Steps to import default dashboard

  1. Import this dashboard into Grafana
  2. Set up the data source to the one where metrics from OpenAPM are getting sent
  3. Save the dashboard

APM Dashboard - RED Metrics APM Dashboard - DB Metrics APM Dashboard - Infra Metrics

Alerting

OpenAPM comes with standard alert definitions that can be used to monitor the applications.

Apdex score calculation

Apdex stands for Application Performance Index. It is a measure of application performance with respect to user satisfaction.

Why is the Apdex score a better measure of application performance?

The definition of application performance varies as per different personas. For the engineer, it's response time; for SRE, it's uptime; for the product manager, it is user retention on the product. These definitions vary as per the roles or priorities of each role. In such cases, we have to find a uniform way to measure the application performance across multiple services, applications, and teams. How to do that?

Apdex solves that problem by quantifying user experience to a number, as at the end of any business all roles are trying to find user experience by using different measurements.

To define Apdex in terms of user experiences, we take three different categories of users.

  1. Satisfied: One who enjoys the application experience without hiccups or slowness
  2. Tolerating: One who faces a lag or slowness but keeps using the application without complaining
  3. Frustrated: One who abandoned the application due to lag or slowness

Based on application response time, the Apdex score measures satisfied, tolerating, and frustrated users.

How to find an Apdex score for an HTTP application?

  • Define a satisfied user response time threshold, for example, a happy customer would get a response within 1 second

  • 4 times the satisfied user threshold defines the threshold for tolerating users. So, using the above example, anything from 1 second to 4 seconds defines the tolerating user response time. Anything above that defines frustrated user response time

  • Find the Apdex score by:

    Apdex score = (
    No. of satisfied users +
    (0.5) * No. of tolerating users
    + 0 * No. of frustrated users
    ) / Total users

For any application, the number of users can be quantified by the number of requests, which is throughput.

Assumptions:

  • Application received 1000 requests in total
  • The satisfied user threshold is 1 second
  • No. of requests finished within 1 second = 600
  • No. of requests finished within 1 to 4 seconds = 200
  • No. of requests finished greater than 4 seconds = 200

Apdex Score will be:

  • Satisfied users: (600 * 1 = 600)

  • Tolerating users: (200 * 0.5 = 100)

  • Frustrating users: (200 * 0 = 0)

    Apdex score = (600 + 100 + 0 ) / 1000 = 0.7

PromQL to calculate Apdex score

For calculating the Apdex score, OpenAPM utilizes following metrics:

  • http_requests_duration_milliseconds_bucket which denotes the latency of the application

  • http_requests_duration_milliseconds_count which denotes the throughput of the application

  • OpenAPM uses P50 as the measure of satisfied users. This is dynamic threshold which measures user experience in real time

    histogram_quantile(0.50, sum by (le) (rate(http_requests_duration_milliseconds_bucket{}[4m])*60))
  • Satisfied Users as satisfied_users

     sum(
    topk(1, sum by (le) (rate(http_requests_duration_milliseconds_bucket{}[4m]) * 60)
    and
    label_value(sum by (le)(http_requests_duration_milliseconds_bucket{}), "le") < threshold))
  • Tolerating Users as tolerating_users

    sum(
    topk(1, sum by (le) (rate(http_requests_duration_milliseconds_bucket{}[4m]) * 60)
    and
    label_value(sum by (le)(http_requests_duration_milliseconds_bucket{}), "le") < 4 * threshold))

    -

    sum(
    topk(1, sum by (le) (rate(http_requests_duration_milliseconds_bucket{}[4m]) * 60)
    and
    label_value(sum by (le)(http_requests_duration_milliseconds_bucket{}), "le") < threshold))

  • Total Users as total_users which considers total requests

    sum (rate(http_requests_duration_milliseconds_count{}[4m]) * 60))
  • Final PromQL

    Apdex score = ( satisfied_users + ( tolerating_users / 2 ) ) / total_users

    The final PromQL to calculate the Apdex score is as follows.

    ((sum(topk(1,sum by (le)(rate(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}[4m])*60) and label_value(sum by (le)(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}), "le") <  histogram_quantile(0.50, sum by (le)(rate(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}[4m])*60)))) +

    (sum(topk(1,sum by (le)(rate(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}[4m])*60) and label_value(sum by (le)(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}), "le") < histogram_quantile(0.90, sum by (le)(rate(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}[4m])*60)))) - sum(topk(1,sum by (le)(rate(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}[4m])*60) and label_value(sum by (le)(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}), "le") < histogram_quantile(0.50, sum by (le)(rate(http_requests_duration_milliseconds_bucket{program=~"$program", version=~"$version", environment=~"$environment"}[4m])*60)))))/2)
    /
    sum (rate(http_requests_duration_milliseconds_count{program=~"$program", version=~"$version", environment=~"$environment"}[4m])*60))
tip

You don't need to remember this complex PromQL. Last9 clusters are equipped with PromQL macros and you just have to use

apdex_score(0.5, "<service_name>", http_requests_duration_milliseconds_bucket, http_requests_duration_milliseconds_count)

Explaining Functions used by the PromQL

label_value

label_value(sum by (le)(http_requests_duration_milliseconds_bucket{}), "le") <  threshold

label_value is used to get any label value from a query. Whatever labels are coming as part of output series from query, we can select which label values we are interested in by providing it's label as 2nd argument to label_value Here threshold is latency (P50) so it must be either in seconds or milliseconds. On receiving label value le which gives time buckets, we are filtering all buckets which has value less than threshold.

and operator

sum by (le) (rate(http_requests_duration_milliseconds_bucket{}[4m]) * 60)
and
label_value(sum by (le)(http_requests_duration_milliseconds_bucket{}), "le")

and operator is used to finding intersection between first query and second query. Here it will give all those timeseries where ouput timeseries of first query is intersecting with output timeseries of second query.

topk

topk(1, sum by (le) (rate(http_requests_duration_milliseconds_bucket{}[4m]) * 60)
and
label_value(sum by (le)(http_requests_duration_milliseconds_bucket{}), "le") < threshold)

topk is used to get top k number of series with highest value from query.

Understaing reason behind [4m]

The Counter is one of the metric types supported by Prometheus-compatible systems. The nature of these metrics is monotonically increasing. Instant values of any such metric are barely helpful. So, to find meaningful information, we have to find the rate of change of these metrics. Prometheus-compatible systems provide three functions to find the rate of change of counters:

  • rate
  • increase
  • irate

To use any of the above functions, Prometheus needs at least 2 data points; otherwise, it will return the single point itself. The recommended duration that specifies a sufficient number of data points to be received is four times the scrape interval so that we have at least >= 2 data points. rate is per second average of metric value over the duration specified (for the example, it is [4m], Prometheus finds this value by finding the slope from the start value to the end value for the duration. As Last9 is Prometheus compatible TDSB, same duration is used in the Apdex score calculation.

Getting Started

Follow the quickstart guide for nodejs applications to know more.