Skip to main content

Declarative Alerting via IaC

Levitate supports configuring alerts and notifications automatically using a Python-based SDK tool which takes care of infrastructure changes

Configurations for alerting and notifications for observability at scale are hard to start, maintain and fix manually, just like provisioning infrastructure at scale. With infrastructure changes, it’s important that the observability stack also catch up with it to avoid the chances of issues because of a lack of observability or black swarm events. Last9 has introduced.l9iac tool to solve the exact same problem.

Installation

It is essential to install the IaC (Infrastructure as Code) tool. This powerful tool allows developers to automate creating entities and configuring alerts easily. The binary can be obtained by signing up for Levitate and contacting Last9 customer support on cs@last9.io.

./install-iac.sh

It is highly recommended that IaC is installed inside a virtual environment, as this provides developers with an isolated space from the rest of the system, allowing them to test and develop their applications more easily. Instructions on how to set up a virtual environment can be found here.

TLDR;

cd <your workspace>
python -m venv env # this will create a ./env dir
source ./env/bin/activate

Quick Start:

  1. Create a YAML as per your alert rule configuration.

    Example: notification_service_am.yaml

# notification_service_am.yaml
entities:
- name: Notification Backend Alert Manager
type: service_alert_manager
data_source: prod-cluster
entity_class: alert-manager
external_ref: unqiue-slug-identifier
indicators:
- name: availability
query: >-
count(sum by (job, taskid)(up{job !~ "ome.*"}) > 0) / count(sum by
(job, taskid) (up{job=~".*vmagent.*", job !~ "ome.*"})) * 100
- name: loss_of_signal
query: 'absent(up{job !~ "ome.*"})'
alert_rules:
- name: Availability of notification service should not be less than 95%
description: >-
The error rate (5xx / total requests) is what defines the
availability, lower value means more degradation
indicator: availability
less_than: 99.5
severity: breach
bad_minutes: 3
total_minutes: 5
group_timeseries_notifications: false
annotations:
team: payments
description: Error Rate described as number of 5xx/throughput
runbook: https://notion.com/runbooks/payments/error_rates_fixing_strategies
  1. Prepare the configuration file for running the IaC tool.

The configuration file has the following structure. It is a JSON file.

{
"api_config": {
"read": {
"refresh_token": "<LAST9_API_READ_REFRESH_TOKEN>",
"api_base_url": "https://app.last9.io/api/v4",
"org": "<ORG_SLUG>"
},
"write": {
"refresh_token": "<LAST9_API_WRITE_REFRESH_TOKEN>",
"api_base_url": "https://app.last9.io/api/v4",
"org": "<ORG_SLUG>"
},
"delete": {
"refresh_token": "<LAST9_API_DELETE_REFRESH_TOKEN>",
"api_base_url": "https://app.last9.io/api/v4",
"org": "<ORG_SLUG>"
}
},
"state_lock_file_path": "state.lock"
}
  • The refresh_token can be obtained from the API Access page from the Last9 dashboard. You need to have refresh_tokens for all 3 operations - read, write and delete as the l9iac tool will perform all these 3 actions while applying the alert rules.
  • The <ORG_SLUG> is your organization's unique slug in Last9. It can be obtained from the API access page of Last9 dashboard.i
  • The default api_base_url is https://app.last9.io/api/v4. If you are on an on-premise setup of Last9, contact cs@last9.io to get the api_base_url.
  • The state_lock_file_path is name of the file where l9iac will store the state lock of current alerting state(on the same lines of terraform state.lock).
  1. Run the following command to do a dry run for the changes.
l9iac -mf notification_service_am.yaml -c config.json plan
  1. Run the following command to apply the changes.
l9iac -mf notification_service_am.yaml -c config.json apply
tip

We will provision the GitOps flow that will run apply command once changes are merged to the master branch in the GitHub repo. Contact cs@last9.io for more details.

Schema

Here is the complete schema for generating the above .yaml file.

Entities:

Entity here can be treated like a individual alert manager

FieldTypeUniqueRequiredDescription
namestringfalsetrueName of the entity (alert manager)
external_refstringtruetrueExternal reference for the entity, it’s a unique slug format identifier for each alert manager
typestringfalsetrueType of the entity
entity_classstringfalseoptionalDenotes the class of the entity. Supported values: alert-manager.
descriptionstringfalseoptionalDescription of the entity
data_sourcestringfalseoptionalData source
data_source_idstringfalseoptionalThe ID of the data source
teamstringfalseoptionalThe team that owns the entity
tierstringfalseoptionalTier of the entity
workspacestringfalseoptionalWorkspace of the entity
namespacestringfalseoptionalThe namespace of the entity
tagsarrayfalseoptionalList of tags for the entity
indicatorsarrayfalseoptionalList of indicators for the entity
alert_rulesarrayfalseoptionalList of alert rules for the entity
notification_channelsstring OR arrayfalseoptionalList of notification channels applicable to the entity
linksarrayfalseoptionalList of links associated with the entity.

Indicators

FieldTypeUniqueRequiredDescription
namestringtrue
uniqueness enforced at an entity levelrequiredName of the indicator
querystringfalserequiredThe PromQL query for the indicator
unitstringfalseoptionalUnit of the indicator
data_sourcestringfalseoptionalData Source of the indicator(Levitate)
descriptionstringfalseoptionalDescription of the indicator

Alert Rules

FieldTypeUniqueRequiredDescription
namestringtruerequiredRule name that describes the alert
descriptionstringtrueoptionalDescription for an alert rule that is included in the alert payload
indicatorstringfalserequiredName of the indicator
greater_thannumberfalseoptionalAlert triggers when the indicator value is greater than this
less_thannumberfalseoptionalAlert triggers when the indicator value is less than this
bad_minutesintegerfalserequiredNumber of minutes the indicator must be in a bad state before alerting
total_minutesintegerfalserequiredTotal number of minutes the indicator is sampled over
is_disabledbooleanfalseoptionalWhether the alert is disabled or not
runbookfalseoptionalrunbook link that will be included as part of the alert payload
severitystringfalseoptionalIt can be a threat or breach.
is_disabledbooleanfalseoptionaldenotes if the alert rules are enabled or disabled
label_filtermap/objectfalseoptionala mapping of the variables present in the indicator query and their pattern for this alert rule
expressionstringfalseoptionalthe alert rule expression. To be used only for pattern-based alerts
group_timeseries_notificationsbooleanfalseoptionalif the multiple affected time series in the alert needs to be grouped as one notification or not
mutebooleanfalseoptionalIf the alert notifications need to be muted or not

Runbook

FieldTypeUniqueRequiredDescription
linkstringfalserequiredrunbook link that is included as part of the alert payload

Notification Channels

FieldTypeUniqueRequiredDescription
namestringfalserequiredName of the notification channel
typestringfalserequiredthe type of notification channel. Allowed values: Slack, Pagerduty, OpsGenie
severitystringfalseoptionalthe severity of the alerts sent through this channel. Allowed values: threat, breach
mentionstring OR list(string)falseoptionalOnly applicable to Slack. The user or list of users to tag in the alert message.
FieldTypeUniqueRequiredDescription
namestringfalserequiredDisplay name of the link
urlstringfalserequiredThe actual URL of the link