Declarative Alerting via IaC
Levitate supports configuring alerts and notifications automatically using a Python-based SDK tool which takes care of infrastructure changes
Configurations for alerting and notifications for observability at scale are
hard to start, maintain and fix manually, just like provisioning infrastructure
at scale. With infrastructure changes, it’s important that the observability
stack also catch up with it to avoid the chances of issues because of a lack of
observability or black swarm events. Last9 has introduced.l9iac
tool to solve
the exact same problem.
Installation
It is essential to install the IaC (Infrastructure as Code) tool. This powerful tool allows developers to automate creating entities and configuring alerts easily. The binary can be obtained by signing up for Levitate and contacting Last9 customer support on cs@last9.io.
./install-iac.sh
It is highly recommended that IaC is installed inside a virtual environment, as this provides developers with an isolated space from the rest of the system, allowing them to test and develop their applications more easily. Instructions on how to set up a virtual environment can be found here.
TLDR;
cd <your workspace>
python -m venv env # this will create a ./env dir
source ./env/bin/activate
Quick Start
-
Create a YAML as per your alert rule configuration.
Example: notification_service_am.yaml
# notification_service_am.yaml
entities:
- name: Notification Backend Alert Manager
type: service_alert_manager
data_source: prod-cluster
entity_class: alert-manager
external_ref: unqiue-slug-identifier
indicators:
- name: availability
query: >-
count(sum by (job, taskid)(up{job !~ "ome.*"}) > 0) / count(sum by
(job, taskid) (up{job=~".*vmagent.*", job !~ "ome.*"})) * 100
- name: loss_of_signal
query: 'absent(up{job !~ "ome.*"})'
alert_rules:
- name: Availability of notification service should not be less than 95%
description: >-
The error rate (5xx / total requests) is what defines the
availability, lower value means more degradation
indicator: availability
less_than: 99.5
severity: breach
bad_minutes: 3
total_minutes: 5
group_timeseries_notifications: false
annotations:
team: payments
description: Error Rate described as number of 5xx/throughput
runbook: https://notion.com/runbooks/payments/error_rates_fixing_strategies
- Prepare the configuration file for running the IaC tool.
The configuration file has the following structure. It is a JSON file.
{
"api_config": {
"read": {
"refresh_token": "<LAST9_API_READ_REFRESH_TOKEN>",
"api_base_url": "https://app.last9.io/api/v4",
"org": "<ORG_SLUG>"
},
"write": {
"refresh_token": "<LAST9_API_WRITE_REFRESH_TOKEN>",
"api_base_url": "https://app.last9.io/api/v4",
"org": "<ORG_SLUG>"
},
"delete": {
"refresh_token": "<LAST9_API_DELETE_REFRESH_TOKEN>",
"api_base_url": "https://app.last9.io/api/v4",
"org": "<ORG_SLUG>"
}
},
"state_lock_file_path": "state.lock"
}
- The
refresh_token
can be obtained from the API Access page from the Last9 dashboard. You need to haverefresh_tokens
for all 3 operations - read, write and delete as thel9iac
tool will perform all these 3 actions while applying the alert rules. - The
<ORG_SLUG>
is your organization's unique slug in Last9. It can be obtained from the API access page of Last9 dashboard.i - The default
api_base_url
ishttps://app.last9.io/api/v4
. If you are on an on-premise setup of Last9, contact cs@last9.io to get theapi_base_url
. - The
state_lock_file_path
is name of the file wherel9iac
will store the state lock of current alerting state(on the same lines of terraform state.lock).
- Run the following command to do a dry run for the changes.
l9iac -mf notification_service_am.yaml -c config.json plan
- Run the following command to apply the changes.
l9iac -mf notification_service_am.yaml -c config.json apply
We will provision the GitOps flow that will run apply
command once changes are
merged to the master branch in the GitHub repo. Contact cs@last9.io for more
details.
Schema
Here is the complete schema for generating the above .yaml
file.
Entities:
Entity here can be treated like a individual alert manager
Field | Type | Unique | Required | Description |
---|---|---|---|---|
name | string | false | true | Name of the entity (alert manager) |
external_ref | string | true | true | External reference for the entity, it’s a unique slug format identifier for each alert manager |
type | string | false | true | Type of the entity |
entity_class | string | false | optional | Denotes the class of the entity. Supported values: alert-manager . |
description | string | false | optional | Description of the entity |
data_source | string | false | optional | Data source |
data_source_id | string | false | optional | The ID of the data source |
team | string | false | optional | The team that owns the entity |
tier | string | false | optional | Tier of the entity |
workspace | string | false | optional | Workspace of the entity |
namespace | string | false | optional | The namespace of the entity |
tags | array | false | optional | List of tags for the entity |
indicators | array | false | optional | List of indicators for the entity |
alert_rules | array | false | optional | List of alert rules for the entity |
notification_channels | string OR array | false | optional | List of notification channels applicable to the entity |
links | array | false | optional | List of links associated with the entity. |
Indicators
Field | Type | Unique | Required | Description |
---|---|---|---|---|
name | string | true | ||
uniqueness enforced at an entity level | required | Name of the indicator | ||
query | string | false | required | The PromQL query for the indicator |
unit | string | false | optional | Unit of the indicator |
data_source | string | false | optional | Data Source of the indicator(Levitate) |
description | string | false | optional | Description of the indicator |
Alert Rules
Field | Type | Unique | Required | Description |
---|---|---|---|---|
name | string | true | required | Rule name that describes the alert |
description | string | true | optional | Description for an alert rule that is included in the alert payload |
indicator | string | false | required | Name of the indicator |
greater_than | number | false | optional | Alert triggers when the indicator value is greater than this |
less_than | number | false | optional | Alert triggers when the indicator value is less than this |
bad_minutes | integer | false | required | Number of minutes the indicator must be in a bad state before alerting |
total_minutes | integer | false | required | Total number of minutes the indicator is sampled over |
is_disabled | boolean | false | optional | Whether the alert is disabled or not |
runbook | false | optional | runbook link that will be included as part of the alert payload | |
severity | string | false | optional | It can be a threat or breach . |
is_disabled | boolean | false | optional | denotes if the alert rules are enabled or disabled |
label_filter | map/object | false | optional | a mapping of the variables present in the indicator query and their pattern for this alert rule |
expression | string | false | optional | the alert rule expression. To be used only for pattern-based alerts |
group_timeseries_notifications | boolean | false | optional | if the multiple affected time series in the alert needs to be grouped as one notification or not |
mute | boolean | false | optional | If the alert notifications need to be muted or not |
Runbook
Field | Type | Unique | Required | Description |
---|---|---|---|---|
link | string | false | required | runbook link that is included as part of the alert payload |
Notification Channels
Field | Type | Unique | Required | Description |
---|---|---|---|---|
name | string | false | required | Name of the notification channel |
type | string | false | required | the type of notification channel. Allowed values: Slack, Pagerduty, OpsGenie |
severity | string | false | optional | the severity of the alerts sent through this channel. Allowed values: threat, breach |
mention | string OR list(string) | false | optional | Only applicable to Slack. The user or list of users to tag in the alert message. |
Link
Field | Type | Unique | Required | Description |
---|---|---|---|---|
name | string | false | required | Display name of the link |
url | string | false | required | The actual URL of the link |
Supported Macros by IaC
low_spike(tolerance, metric)
high_spike(tolerance, metric)
decreasing_changepoint(tolerance, metric)
increasing_changepoint(tolerance, metric)
increasing_trend(tolerance, metric)
decreasing_trend(tolerance, metric)