Prometheus - Rules


Prometheus
3. Rules
So far, we have seen how to use Prometheus to keep track of metrics for different services. But we still need to check those data by ourselves which might be complicated if there are a lot of services. Sometimes, we simply want to know when something is wrong (a service is not working properly) or simply when a specific event occurs (a server has reached 90% of its capacity).
To do so, we can define rules. Rules are of two types in Prometheus:
- recording rules : recording rules are used to pre-compute highly demanded expression and store them in the Time Series database.
- alerting rules : alerting rules are used to trigger a behavior in case certain events happen.
Rules are written in YAML
files which are linked to in the Prometheus Server configuration file (prometheus.yml
). Those files are read by Prometheus at regular interval to integrate any changes.
Recording rules
A recording rule is basically a PromQL
that is evaluated every now and then. The result of this expression is then stored in the Time Series database. This is especially useful for expression that may take a long time to compute and that are frequently asked by any Prometheus Server client, being its dashboard or another visualization service.
Rules are grouped. The first key of a rules file should be groups
and the values associated should be the name of those groups.
Open a rule file:
nano prometheus/recording_rules.yml
A very first rule file would be:
groups:
- name: first_recording_rule_group
rules:
- record: my_first_recording_metric
expr: sum by (handler) (prometheus_http_requests_total)
The name of the rule group is first_recording_rule_group
. There is only one rule in the file yet: this rule creates the metric my_first_recording_metric
which is the result of the sum by (handler) (prometheus_http_requests_total)
expression.
We can have several rules in this group and several groups in this file:
groups:
- name: first_recording_rule_group
rules:
- record: my_first_recording_metric
expr: sum by (handler) (prometheus_http_requests_total)
- record: my_second_recording_metric
expr: sum by (job) (prometheus_http_requests_total)
- name: second_recording_rule_group
rules:
- record: my_third_recording_metric
expr: avg_over_time(prometheus_http_request_duration_seconds_bucket [20m])
Once this is copy/pasted, you can save and exit the file.
Prometheus provides a tool to check if a rules file is well written:
promtool
.
prometheus/promtool check rules prometheus/recording_rules.yml
This should print out:
Checking recording_rules.yml
SUCCESS: 2 rules found
To give this rule file to the Prometheus Server, we have to change the
prometheus.yml
file: under the keyrule_files
:
rule_files:
- "recording_rules.yml"
Shut down and restart Prometheus Server and go to the dashboard. You can go to the panel Status > Rules
.
We can also query this metric in the Graph
panel.
Alerting rules
One of the main interest of Prometheus is its ability to trigger Alerts
. To define alerts, we need to define rules. The principle is very similar to the recording rules.
Open a
alerting_rules.yml
file:
nano prometheus/alerting_rules.yml
We can define our first group of alerting rules:
groups:
- name: my_first_alerting_rules_group
rules:
- alert: MyFirstAlert
expr: increase(nb_of_requests_total [1m]) > 20
The key alert
is used in place of record
and the expression should return a boolean expression. Here our alert is triggered when the number of requests gets over 20 in 1 minute on our Python app.
You can check the validity of the rules by using the
promtool
command:
prometheus/promtool check rules prometheus/alerting_rules.yml
We need to change the Prometheus configuration file:
rule_files:
- "recording_rules.yml"
- "alerting_rules.yml"
Start over the server and open the dashboard. In the Alerts
menu. First the Alert is classified as Inactive
.
You can create the quick_client.py
script to trigger this alert. It will be necessary to restart the Flask server that we have defined in the app_metrics.py
file to be able to execute the following code.
Add the code below to the file
quick_client.py
import time
import requests
address = 'http://localhost:5000/'
for i in range(100):
# sleeping
time.sleep(0.1)
# this help us not to wait for our response
try:
requests.get(address, timeout=10e-9)
except requests.exceptions.ReadTimeout:
pass
You can see the Alert in the Firing
menu when it is triggered. If you wait long enough, the alert will be classified as Inactive
.
We can add a second rule and restart the Prometheus Server:
groups:
- name: my_first_alerting_rules_group
rules:
- alert: MyFirstAlert
expr: increase(nb_of_requests_total [1m]) > 20
- alert: MySecondAlert
expr: increase(nb_of_requests_total [1m]) > 10
for: 20s
Here we have added the for
key. This means that when the alert should be fired is it classified as Pending
. If after the for
period the alert is still present, it is fired.
We can also add labels
and annotations
to our alerting rules. Note that the alerting rules are also visible in the Status > Rules
menu as for the recording rules.
By default rules are computed every 1 minute. But we can change that by using Prometheus configuration file: in the global
key, we can specify a different value by using evaluation_interval
.
These alerts are a good way to spot particular moment on a service but that does not take action for the moment.
