COURSE

Prometheus - Rules

DIFFICULTY

Normal

APPROXIMATE TIME

1h00

RELATED MACHINE

Ubuntu Server 20.04 LTS

SSD Volume Type 64- bit x86

Prometheus

3. Rules

So far, we have seen how to use Prometheus to keep track of metrics for different services. But we still need to check those data by ourselves which might be complicated if there are a lot of services. Sometimes, we simply want to know when something is wrong (a service is not working properly) or simply when a specific event occurs (a server has reached 90% of its capacity).

To do so, we can define rules. Rules are of two types in Prometheus:

recording rules : recording rules are used to pre-compute highly demanded expression and store them in the Time Series database.
alerting rules : alerting rules are used to trigger a behavior in case certain events happen.

Rules are written in YAML files which are linked to in the Prometheus Server configuration file (prometheus.yml). Those files are read by Prometheus at regular interval to integrate any changes.

Recording rules

A recording rule is basically a PromQL that is evaluated every now and then. The result of this expression is then stored in the Time Series database. This is especially useful for expression that may take a long time to compute and that are frequently asked by any Prometheus Server client, being its dashboard or another visualization service.

Rules are grouped. The first key of a rules file should be groups and the values associated should be the name of those groups.

Open a rule file:

nano prometheus/recording_rules.yml

A very first rule file would be:

groups:
  - name: first_recording_rule_group
    rules: 
    - record: my_first_recording_metric
      expr: sum by (handler) (prometheus_http_requests_total)

The name of the rule group is first_recording_rule_group. There is only one rule in the file yet: this rule creates the metric my_first_recording_metric which is the result of the sum by (handler) (prometheus_http_requests_total) expression.

We can have several rules in this group and several groups in this file:

groups:
  - name: first_recording_rule_group
    rules: 
    - record: my_first_recording_metric
      expr: sum by (handler) (prometheus_http_requests_total)
    - record: my_second_recording_metric
      expr: sum by (job) (prometheus_http_requests_total)
  - name: second_recording_rule_group
    rules:
      - record: my_third_recording_metric
        expr: avg_over_time(prometheus_http_request_duration_seconds_bucket [20m])

Once this is copy/pasted, you can save and exit the file.

Prometheus provides a tool to check if a rules file is well written: promtool.

prometheus/promtool check rules prometheus/recording_rules.yml

This should print out:

Checking recording_rules.yml
  SUCCESS: 2 rules found

To give this rule file to the Prometheus Server, we have to change the prometheus.yml file: under the key rule_files:

rule_files:
  - "recording_rules.yml"

Shut down and restart Prometheus Server and go to the dashboard. You can go to the panel Status > Rules.

We can also query this metric in the Graph panel.

Alerting rules

One of the main interest of Prometheus is its ability to trigger Alerts. To define alerts, we need to define rules. The principle is very similar to the recording rules.

Open a alerting_rules.yml file:

nano prometheus/alerting_rules.yml

We can define our first group of alerting rules:

groups: 
  - name: my_first_alerting_rules_group
    rules:
      - alert: MyFirstAlert
        expr: increase(nb_of_requests_total [1m]) > 20

The key alert is used in place of record and the expression should return a boolean expression. Here our alert is triggered when the number of requests gets over 20 in 1 minute on our Python app.

You can check the validity of the rules by using the promtool command:

prometheus/promtool check rules prometheus/alerting_rules.yml

We need to change the Prometheus configuration file:

rule_files:
  - "recording_rules.yml"
  - "alerting_rules.yml"

Start over the server and open the dashboard. In the Alerts menu. First the Alert is classified as Inactive. You can create the quick_client.py script to trigger this alert. It will be necessary to restart the Flask server that we have defined in the app_metrics.py file to be able to execute the following code.

Add the code below to the file quick_client.py

import time 
import requests

address = 'http://localhost:5000/'

for i in range(100):
    # sleeping
    time.sleep(0.1)
    # this help us not to wait for our response
    try: 
        requests.get(address, timeout=10e-9)
    except requests.exceptions.ReadTimeout:
        pass

You can see the Alert in the Firing menu when it is triggered. If you wait long enough, the alert will be classified as Inactive.

We can add a second rule and restart the Prometheus Server:

groups: 
  - name: my_first_alerting_rules_group
    rules:
      - alert: MyFirstAlert
        expr: increase(nb_of_requests_total [1m]) > 20
      - alert: MySecondAlert
        expr: increase(nb_of_requests_total [1m]) > 10
        for: 20s

Here we have added the for key. This means that when the alert should be fired is it classified as Pending. If after the for period the alert is still present, it is fired. We can also add labels and annotations to our alerting rules. Note that the alerting rules are also visible in the Status > Rules menu as for the recording rules.

By default rules are computed every 1 minute. But we can change that by using Prometheus configuration file: in the global key, we can specify a different value by using evaluation_interval.

These alerts are a good way to spot particular moment on a service but that does not take action for the moment.

Lesson validated

Next Exercise

Prometheus - AlertManager

Module progress : Prometheus (EN)