Skip to main content

Declarative Alerting via IaC

Levitate supports configuring alerts and notifications automatically using a Python-based SDK tool which takes care of infrastructure changes

Configurations for alerting and notifications for observability at scale are hard to start, maintain and fix manually, just like provisioning infrastructure at scale. With infrastructure changes, it’s important that the observability stack also catch up with it to avoid the chances of issues because of a lack of observability or black swarm events. Last9 has introduced.l9iac tool to solve the exact same problem.

Installation

It is essential to install the IaC (Infrastructure as Code) tool. This powerful tool allows developers to automate creating entities and configuring alerts easily. The binary can be obtained by signing up for Levitate and contacting Last9 customer support on cs@last9.io.

./install-iac.sh

It is highly recommended that IaC is installed inside a virtual environment, as this provides developers with an isolated space from the rest of the system, allowing them to test and develop their applications more easily. Instructions on how to set up a virtual environment can be found here.

TLDR;

cd <your workspace>
python -m venv env # this will create a ./env dir
source ./env/bin/activate

Quick Start

  1. Create a YAML as per your alert rule configuration

    Example: notification_service_am.yaml

    # notification_service_am.yaml
    entities:
    - name: Notification Backend Alert Manager
    type: service_alert_manager
    data_source: prod-cluster
    entity_class: alert-manager
    external_ref: unqiue-slug-identifier
    indicators:
    - name: availability
    query: count(sum by (job, taskid)(up{job !~ "ome.*"}) > 0) / count(sum by (job, taskid) (up{job=~".*vmagent.*", job !~ "ome.*"})) * 100
    - name: loss_of_signal
    query: 'absent(up{job !~ "ome.*"})'
    alert_rules:
    - name: Availability of notification service should not be less than 95%
    description: The error rate (5xx / total requests) is what defines the availability, lower value means more degradation
    indicator: availability
    less_than: 99.5
    severity: breach
    bad_minutes: 3
    total_minutes: 5
    group_timeseries_notifications: false
    annotations:
    team: payments
    description: Error Rate described as number of 5xx/throughput
    runbook: https://notion.com/runbooks/payments/error_rates_fixing_strategies
  2. Prepare the configuration file for running the IaC tool

    The configuration file has the following structure. It is a JSON file.

    {
    "api_config": {
    "read": {
    "refresh_token": "<LAST9_API_READ_REFRESH_TOKEN>",
    "api_base_url": "https://app.last9.io/api/v4",
    "org": "<ORG_SLUG>"
    },
    "write": {
    "refresh_token": "<LAST9_API_WRITE_REFRESH_TOKEN>",
    "api_base_url": "https://app.last9.io/api/v4",
    "org": "<ORG_SLUG>"
    },
    "delete": {
    "refresh_token": "<LAST9_API_DELETE_REFRESH_TOKEN>",
    "api_base_url": "https://app.last9.io/api/v4",
    "org": "<ORG_SLUG>"
    }
    },
    "state_lock_file_path": "state.lock"
    }
    • The refresh_token can be obtained from the API Access page from the Last9 dashboard. You need to have refresh_tokens for all 3 operations - read, write and delete as the l9iac tool will perform all these 3 actions while applying the alert rules.
    • The <ORG_SLUG> is your organization's unique slug in Last9. It can be obtained from the API access page of Last9 dashboard.i
    • The default api_base_url is https://app.last9.io/api/v4. If you are on an on-premise setup of Last9, contact cs@last9.io to get the api_base_url.
    • The state_lock_file_path is name of the file where l9iac will store the state lock of current alerting state(on the same lines of terraform state.lock).
  3. Run the following command to do a dry run for the changes

    l9iac -mf notification_service_am.yaml -c config.json plan
  4. Run the following command to apply the changes

    l9iac -mf notification_service_am.yaml -c config.json apply
tip

We will provision the GitOps flow that will run apply command once changes are merged to the master branch in the GitHub repo. Contact cs@last9.io for more details.

Schema

Here is the complete schema for generating the above .yaml file:

Entities

FieldTypeUniqueRequiredDescription
namestringfalsetrueName of the entity (alert manager)
typestringfalsetrueType of the entity
external_refstringtruetrueExternal reference for the entity, it’s a unique slug format identifier for each alert manager
adhoc_filterobjectfalseoptionalList of common rule filters for the entity
alert_rulesarrayfalseoptionalList of alert rules for the entity
data_sourcestringfalseoptionalData source
data_source_idstringfalseoptionalThe ID of the data source
descriptionstringfalseoptionalDescription of the entity
entity_classstringfalseoptionalDenotes the class of the entity. Supported values: alert-manager
indicatorsarrayfalseoptionalList of indicators for the entity
labelsobjectfalseoptionalList of key value pairs of group label names and values
linksarrayfalseoptionalList of links associated with the entity
namespacestringfalseoptionalThe namespace of the entity
notification_channelsstring OR arrayfalseoptionalList of notification channels applicable to the entity
tagsarrayfalseoptionalList of tags for the entity
teamstringfalseoptionalThe team that owns the entity
tierstringfalseoptionalTier of the entity
ui_readonlybooleanfalseoptionalDisable any sort of edits to the alert group from the UI
workspacestringfalseoptionalWorkspace of the entity

Common Rule Filters (Adhoc Filters)

FieldTypeUniqueRequiredDescription
labelsobjectfalserequiredList of key value pairs of label names and values
data_sourcestringfalserequiredDefaults to entity's data source

Alert Rules

FieldTypeUniqueRequiredDescription
namestringtruerequiredRule name that describes the alert
indicatorstringfalserequiredName of the indicator
bad_minutesintegerfalserequiredNumber of minutes the indicator must be in a bad state before alerting
total_minutesintegerfalserequiredTotal number of minutes the indicator is sampled over
descriptionstringtrueoptionalDescription for an alert rule that is included in the alert payload
expressionstringfalseoptionalAlert rule expression, to be used only for pattern-based alerts
greater_thannumberfalseoptionalAlert triggers when the indicator value is greater than this
group_timeseries_notificationsbooleanfalseoptionalIf multiple impacted time series in an alert need to be grouped as one notification or not
is_disabledbooleanfalseoptionalWhether the alert is disabled or not
label_filtermap/objectfalseoptionalMapping of the variables present in the indicator query and their pattern for the alert rule
less_thannumberfalseoptionalAlert triggers when the indicator value is less than this
mutebooleanfalseoptionalIf alert notifications need to be muted or not
runbookfalseoptionalRunbook link to be included in the alert payload
severitystringfalseoptionalCan be a threat or breach

Runbook

FieldTypeUniqueRequiredDescription
linkstringfalserequiredRunbook link to be included in the alert payload

Indicators

FieldTypeUniqueRequiredDescription
namestringtrue, uniqueness enforced at entity levelrequiredName of the indicator
querystringfalserequiredPromQL query for the indicator
data_sourcestringfalseoptionalData Source of the indicator (Levitate)
descriptionstringfalseoptionalDescription of the indicator
unitstringfalseoptionalUnit of the indicator
FieldTypeUniqueRequiredDescription
namestringfalserequiredDisplay name of the link
urlstringfalserequiredURL of the link

Notification Channels

FieldTypeUniqueRequiredDescription
namestringfalserequiredName of the notification channel
typestringfalserequiredType of notification channel. Allowed values: Slack, Pagerduty, OpsGenie
mentionstring OR list (string)falseoptionalOnly applicable to Slack. The user(s) to tag in the alert message
severitystringfalseoptionalSeverity of the alerts sent through this channel. Allowed values: threat, breach

Supported Macros by IaC

  • low_spike (tolerance, metric)
  • high_spike (tolerance, metric)
  • decreasing_changepoint (tolerance, metric)
  • increasing_changepoint (tolerance, metric)
  • increasing_trend (tolerance, metric)
  • decreasing_trend (tolerance, metric)