Skip to main content

Declarative Alerting via IaC

Last9 supports configuring alerts and notifications automatically using a Python-based SDK tool which takes care of infrastructure changes

Configurations for alerting and notifications for observability at scale are hard to start, maintain and fix manually, just like provisioning infrastructure at scale. With infrastructure changes, it’s important that the observability stack also catch up with it to avoid the chances of issues because of a lack of observability or black swarm events. Last9 has introduced.l9iac tool to solve the exact same problem.

Installation​

Last9's IaC (Infrastructure as Code) tool is available as a Docker image, providing a consistent and isolated environment for automating entity creation and alert configuration.

  1. Pull the Docker Image

    docker pull last9system/iac:latest

    The image is available on DockerHub.

  2. Prepare Your Working Directory Create a directory containing:

    • Your IaC YAML files
    • config.json with your refresh tokens (see file structure)
    • Space for the state lock file
  3. Run the Docker Container

    docker run --name l9iac -d -v <local-path>:<container-path> last9system/iac:<version>

    Example:

    docker run -d -v /home/user/iac-files:/app/rules last9system/iac:2.4.2

    πŸ’‘ Note: If using Docker Desktop, ensure file sharing is enabled for the volume mount.

  4. Execute IaC Commands

    docker exec -it <container-id> l9iac -mf <model-file-path> -c <config-file-path> <command>

    Example:

    docker exec -it bcdea6660fd4 l9iac -mf /app/rules/alert-rules.yaml -c /app/rules/config.json plan

Configuration File Structure​

The IaC tool requires a config.json file with the following structure:

{
"api_config": {
"read": {
"refresh_token": "<LAST9_API_READ_REFRESH_TOKEN>",
"api_base_url": "https://app.last9.io/api/v4",
"org": "<ORG_SLUG>"
},
"write": {
"refresh_token": "<LAST9_API_WRITE_REFRESH_TOKEN>",
"api_base_url": "https://app.last9.io/api/v4",
"org": "<ORG_SLUG>"
},
"delete": {
"refresh_token": "<LAST9_API_DELETE_REFRESH_TOKEN>",
"api_base_url": "https://app.last9.io/api/v4",
"org": "<ORG_SLUG>"
}
},
"state_lock_file_path": "state.lock" // Should be in the same directory as model_file and config_file
}

Important Notes​

  • The refresh_token values can be obtained from the API Access page in the Last9 dashboard (know more)
  • The <ORG_SLUG> can be obtained from the app’s URL: app.last9.io/v2/organizations/<ORG_SLUG>
  • For on-premise Last9 setups, contact cs@last9.io to get the correct api_base_url
  • The state_lock_file_path should be accessible from the directory where you run the IaC commands

Quick Start​

  1. Create a YAML as per your alert rule configuration

    Example: notification_service_am.yaml

    # notification_service_am.yaml
    entities:
    - name: Notification Backend Alert Manager
    type: service_alert_manager
    data_source: prod-cluster
    entity_class: alert-manager
    external_ref: unqiue-slug-identifier
    indicators:
    - name: availability
    query: count(sum by (job, taskid)(up{job !~ "ome.*"}) > 0) / count(sum by (job, taskid) (up{job=~".*vmagent.*", job !~ "ome.*"})) * 100
    - name: loss_of_signal
    query: 'absent(up{job !~ "ome.*"})'
    alert_rules:
    - name: Availability of notification service should not be less than 95%
    description: The error rate (5xx / total requests) is what defines the availability, lower value means more degradation
    indicator: availability
    less_than: 99.5
    severity: breach
    bad_minutes: 3
    total_minutes: 5
    group_timeseries_notifications: false
    annotations:
    team: payments
    description: Error Rate described as number of 5xx/throughput
    runbook: https://notion.com/runbooks/payments/error_rates_fixing_strategies
  2. Prepare the configuration file for running the IaC tool

    The configuration file has the following structure. It is a JSON file.

    {
    "api_config": {
    "read": {
    "refresh_token": "<LAST9_API_READ_REFRESH_TOKEN>",
    "api_base_url": "https://app.last9.io/api/v4",
    "org": "<ORG_SLUG>"
    },
    "write": {
    "refresh_token": "<LAST9_API_WRITE_REFRESH_TOKEN>",
    "api_base_url": "https://app.last9.io/api/v4",
    "org": "<ORG_SLUG>"
    },
    "delete": {
    "refresh_token": "<LAST9_API_DELETE_REFRESH_TOKEN>",
    "api_base_url": "https://app.last9.io/api/v4",
    "org": "<ORG_SLUG>"
    }
    },
    "state_lock_file_path": "state.lock"
    }
    • The refresh_token can be obtained from the API Access page from the Last9 dashboard. You need to have refresh_tokens for all 3 operations - read, write and delete as the l9iac tool will perform all these 3 actions while applying the alert rules.
    • The <ORG_SLUG> is your organization's unique slug in Last9. It can be obtained from the API access page of Last9 dashboard.i
    • The default api_base_url is https://app.last9.io/api/v4. If you are on an on-premise setup of Last9, contact cs@last9.io to get the api_base_url.
    • The state_lock_file_path is name of the file where l9iac will store the state lock of current alerting state(on the same lines of terraform state.lock).
  3. Run the following command to do a dry run for the changes

    l9iac -mf notification_service_am.yaml -c config.json plan
  4. Run the following command to apply the changes

    l9iac -mf notification_service_am.yaml -c config.json apply
tip

We will provision the GitOps flow that will run apply command once changes are merged to the master branch in the GitHub repo. Contact cs@last9.io for more details.

Schema​

Here is the complete schema for generating the above .yaml file:

Entities​

FieldTypeUniqueRequiredDescription
namestringfalsetrueName of the entity (alert manager)
typestringfalsetrueType of the entity
external_refstringtruetrueExternal reference for the entity, it’s a unique slug format identifier for each alert manager
adhoc_filterobjectfalseoptionalList of common rule filters for the entity
alert_rulesarrayfalseoptionalList of alert rules for the entity
data_sourcestringfalseoptionalData source
data_source_idstringfalseoptionalThe ID of the data source
descriptionstringfalseoptionalDescription of the entity
entity_classstringfalseoptionalDenotes the class of the entity. Supported values: alert-manager
indicatorsarrayfalseoptionalList of indicators for the entity
labelsobjectfalseoptionalList of key value pairs of group label names and values
linksarrayfalseoptionalList of links associated with the entity
namespacestringfalseoptionalThe namespace of the entity
notification_channelsstring OR arrayfalseoptionalList of notification channels applicable to the entity
tagsarrayfalseoptionalList of tags for the entity
teamstringfalseoptionalThe team that owns the entity
tierstringfalseoptionalTier of the entity
ui_readonlybooleanfalseoptionalDisable any sort of edits to the alert group from the UI
workspacestringfalseoptionalWorkspace of the entity

Common Rule Filters (Adhoc Filters)​

FieldTypeUniqueRequiredDescription
labelsobjectfalserequiredList of key value pairs of label names and values
data_sourcestringfalserequiredDefaults to entity's data source

Alert Rules​

FieldTypeUniqueRequiredDescription
namestringtruerequiredRule name that describes the alert
indicatorstringfalserequiredName of the indicator
bad_minutesintegerfalserequiredNumber of minutes the indicator must be in a bad state before alerting
total_minutesintegerfalserequiredTotal number of minutes the indicator is sampled over
descriptionstringtrueoptionalDescription for an alert rule that is included in the alert payload
expressionstringfalseoptionalAlert rule expression, to be used only for pattern-based alerts
greater_thannumberfalseoptionalAlert triggers when the indicator value is greater than this
greater_than_eqnumberfalseoptionalAlert triggers when the indicator value is greater than or equal to this
less_thannumberfalseoptionalAlert triggers when the indicator value is less than this
less_than_eqnumberfalseoptionalAlert triggers when the indicator value is less than or equal to this
equal_tonumberfalseoptionalAlert triggers when the indicator value is equal to this
not_equalnumberfalseoptionalAlert triggers when the indicator value is not equal to this
group_timeseries_notificationsbooleanfalseoptionalIf multiple impacted time series in an alert need to be grouped as one notification or not
is_disabledbooleanfalseoptionalWhether the alert is disabled or not
label_filtermap/objectfalseoptionalMapping of the variables present in the indicator query and their pattern for the alert rule
mutebooleanfalseoptionalIf alert notifications need to be muted or not
runbookfalseoptionalRunbook link to be included in the alert payload
severitystringfalseoptionalCan be a threat or breach

Runbook​

FieldTypeUniqueRequiredDescription
linkstringfalserequiredRunbook link to be included in the alert payload

Indicators​

FieldTypeUniqueRequiredDescription
namestringtrue, uniqueness enforced at entity levelrequiredName of the indicator
querystringfalserequiredPromQL query for the indicator
data_sourcestringfalseoptionalData Source of the indicator (Last9)
descriptionstringfalseoptionalDescription of the indicator
unitstringfalseoptionalUnit of the indicator
FieldTypeUniqueRequiredDescription
namestringfalserequiredDisplay name of the link
urlstringfalserequiredURL of the link

Notification Channels​

FieldTypeUniqueRequiredDescription
namestringfalserequiredName of the notification channel
typestringfalserequiredType of notification channel. Allowed values: Slack, Pagerduty, OpsGenie
mentionstring OR list (string)falseoptionalOnly applicable to Slack. The user(s) to tag in the alert message
severitystringfalseoptionalSeverity of the alerts sent through this channel. Allowed values: threat, breach

Supported Macros by IaC​

  • low_spike (tolerance, metric)
  • high_spike (tolerance, metric)
  • decreasing_changepoint (tolerance, metric)
  • increasing_changepoint (tolerance, metric)
  • increasing_trend (tolerance, metric)
  • decreasing_trend (tolerance, metric)

Troubleshooting​

Please get in touch with us on Discord or Email if you have any questions.