Declarative Alerting via IaC
Levitate supports configuring alerts and notifications automatically using a Python-based SDK tool which takes care of infrastructure changes
Configurations for alerting and notifications for observability at scale are
hard to start, maintain and fix manually, just like provisioning infrastructure
at scale. With infrastructure changes, it’s important that the observability
stack also catch up with it to avoid the chances of issues because of a lack of
observability or black swarm events. Last9 has introduced.l9iac
tool to solve
the exact same problem.
Installation
It is essential to install the IaC (Infrastructure as Code) tool. This powerful tool allows developers to automate creating entities and configuring alerts easily. The binary can be obtained by signing up for Levitate and contacting Last9 customer support on cs@last9.io.
./install-iac.sh
It is highly recommended that IaC is installed inside a virtual environment, as this provides developers with an isolated space from the rest of the system, allowing them to test and develop their applications more easily. Instructions on how to set up a virtual environment can be found here.
TLDR;
cd <your workspace>
python -m venv env # this will create a ./env dir
source ./env/bin/activate
Quick Start
-
Create a YAML as per your alert rule configuration
Example: notification_service_am.yaml
# notification_service_am.yaml
entities:
- name: Notification Backend Alert Manager
type: service_alert_manager
data_source: prod-cluster
entity_class: alert-manager
external_ref: unqiue-slug-identifier
indicators:
- name: availability
query: count(sum by (job, taskid)(up{job !~ "ome.*"}) > 0) / count(sum by (job, taskid) (up{job=~".*vmagent.*", job !~ "ome.*"})) * 100
- name: loss_of_signal
query: 'absent(up{job !~ "ome.*"})'
alert_rules:
- name: Availability of notification service should not be less than 95%
description: The error rate (5xx / total requests) is what defines the availability, lower value means more degradation
indicator: availability
less_than: 99.5
severity: breach
bad_minutes: 3
total_minutes: 5
group_timeseries_notifications: false
annotations:
team: payments
description: Error Rate described as number of 5xx/throughput
runbook: https://notion.com/runbooks/payments/error_rates_fixing_strategies -
Prepare the configuration file for running the IaC tool
The configuration file has the following structure. It is a JSON file.
{
"api_config": {
"read": {
"refresh_token": "<LAST9_API_READ_REFRESH_TOKEN>",
"api_base_url": "https://app.last9.io/api/v4",
"org": "<ORG_SLUG>"
},
"write": {
"refresh_token": "<LAST9_API_WRITE_REFRESH_TOKEN>",
"api_base_url": "https://app.last9.io/api/v4",
"org": "<ORG_SLUG>"
},
"delete": {
"refresh_token": "<LAST9_API_DELETE_REFRESH_TOKEN>",
"api_base_url": "https://app.last9.io/api/v4",
"org": "<ORG_SLUG>"
}
},
"state_lock_file_path": "state.lock"
}- The
refresh_token
can be obtained from the API Access page from the Last9 dashboard. You need to haverefresh_tokens
for all 3 operations - read, write and delete as thel9iac
tool will perform all these 3 actions while applying the alert rules. - The
<ORG_SLUG>
is your organization's unique slug in Last9. It can be obtained from the API access page of Last9 dashboard.i - The default
api_base_url
ishttps://app.last9.io/api/v4
. If you are on an on-premise setup of Last9, contact cs@last9.io to get theapi_base_url
. - The
state_lock_file_path
is name of the file wherel9iac
will store the state lock of current alerting state(on the same lines of terraform state.lock).
- The
-
Run the following command to do a dry run for the changes
l9iac -mf notification_service_am.yaml -c config.json plan
-
Run the following command to apply the changes
l9iac -mf notification_service_am.yaml -c config.json apply
We will provision the GitOps flow that will run apply
command once changes are
merged to the master branch in the GitHub repo. Contact cs@last9.io for more
details.
Schema
Here is the complete schema for generating the above .yaml
file:
Entities
Field | Type | Unique | Required | Description |
---|---|---|---|---|
name | string | false | true | Name of the entity (alert manager) |
type | string | false | true | Type of the entity |
external_ref | string | true | true | External reference for the entity, it’s a unique slug format identifier for each alert manager |
adhoc_filter | object | false | optional | List of common rule filters for the entity |
alert_rules | array | false | optional | List of alert rules for the entity |
data_source | string | false | optional | Data source |
data_source_id | string | false | optional | The ID of the data source |
description | string | false | optional | Description of the entity |
entity_class | string | false | optional | Denotes the class of the entity. Supported values: alert-manager |
indicators | array | false | optional | List of indicators for the entity |
labels | object | false | optional | List of key value pairs of group label names and values |
links | array | false | optional | List of links associated with the entity |
namespace | string | false | optional | The namespace of the entity |
notification_channels | string OR array | false | optional | List of notification channels applicable to the entity |
tags | array | false | optional | List of tags for the entity |
team | string | false | optional | The team that owns the entity |
tier | string | false | optional | Tier of the entity |
ui_readonly | boolean | false | optional | Disable any sort of edits to the alert group from the UI |
workspace | string | false | optional | Workspace of the entity |
Common Rule Filters (Adhoc Filters)
Field | Type | Unique | Required | Description |
---|---|---|---|---|
labels | object | false | required | List of key value pairs of label names and values |
data_source | string | false | required | Defaults to entity's data source |
Alert Rules
Field | Type | Unique | Required | Description |
---|---|---|---|---|
name | string | true | required | Rule name that describes the alert |
indicator | string | false | required | Name of the indicator |
bad_minutes | integer | false | required | Number of minutes the indicator must be in a bad state before alerting |
total_minutes | integer | false | required | Total number of minutes the indicator is sampled over |
description | string | true | optional | Description for an alert rule that is included in the alert payload |
expression | string | false | optional | Alert rule expression, to be used only for pattern-based alerts |
greater_than | number | false | optional | Alert triggers when the indicator value is greater than this |
greater_than_eq | number | false | optional | Alert triggers when the indicator value is greater than or equal to this |
less_than | number | false | optional | Alert triggers when the indicator value is less than this |
less_than_eq | number | false | optional | Alert triggers when the indicator value is less than or equal to this |
equal_to | number | false | optional | Alert triggers when the indicator value is equal to this |
not_equal | number | false | optional | Alert triggers when the indicator value is not equal to this |
group_timeseries_notifications | boolean | false | optional | If multiple impacted time series in an alert need to be grouped as one notification or not |
is_disabled | boolean | false | optional | Whether the alert is disabled or not |
label_filter | map/object | false | optional | Mapping of the variables present in the indicator query and their pattern for the alert rule |
mute | boolean | false | optional | If alert notifications need to be muted or not |
runbook | false | optional | Runbook link to be included in the alert payload | |
severity | string | false | optional | Can be a threat or breach |
Runbook
Field | Type | Unique | Required | Description |
---|---|---|---|---|
link | string | false | required | Runbook link to be included in the alert payload |
Indicators
Field | Type | Unique | Required | Description |
---|---|---|---|---|
name | string | true, uniqueness enforced at entity level | required | Name of the indicator |
query | string | false | required | PromQL query for the indicator |
data_source | string | false | optional | Data Source of the indicator (Levitate) |
description | string | false | optional | Description of the indicator |
unit | string | false | optional | Unit of the indicator |
Links
Field | Type | Unique | Required | Description |
---|---|---|---|---|
name | string | false | required | Display name of the link |
url | string | false | required | URL of the link |
Notification Channels
Field | Type | Unique | Required | Description |
---|---|---|---|---|
name | string | false | required | Name of the notification channel |
type | string | false | required | Type of notification channel. Allowed values: Slack , Pagerduty , OpsGenie |
mention | string OR list (string) | false | optional | Only applicable to Slack. The user(s) to tag in the alert message |
severity | string | false | optional | Severity of the alerts sent through this channel. Allowed values: threat , breach |
Supported Macros by IaC
low_spike (tolerance, metric)
high_spike (tolerance, metric)
decreasing_changepoint (tolerance, metric)
increasing_changepoint (tolerance, metric)
increasing_trend (tolerance, metric)
decreasing_trend (tolerance, metric)
Troubleshooting
Please get in touch with us on Discord or Email if you have any questions.