Troubleshooting Duplicate Alerts in Prometheus' Alertmanager
Duplicate alerts getting triggered. Wasted hours trying to fix. But learnt something.
At my workplace, we have Prometheus' alertmanager setup with receivers to JIRA automation, AWS Incident Manager, Slack and other webhooks with alerts triggered by PromQL expressions.
I was assigned a task to expose a method that triggers a Slack message to a particular channel. Fairly straightforward task. Alertmanager already has an API for this. I just had to hit this API and write a matcher to ensure the triggered alert goes to the right slack channel.
Alert gets triggered with this request
curl --request POST \
--url https://alertmanager-instance.company.com/api/v1/alerts \
--header 'Content-Type: application/json' \
--data '[
{
"annotations": {
"summary": "Prod down, company in huge loss."
},
"labels": {
"alertname": "A very serious alert - SEV1",
"team": "intern-oncall"
}
}
]'
Matcher is on the label team
and goes to slack-receiver.
- match:
team: intern-oncall
receiver: slack-receiver
group_by: ['alertname']
group_wait: 10s
group_interval: 10m
repeat_interval: 4h
continue: true
- name: 'slack-receiver'
slack_configs:
- channel: '#oncall-alerts'
text: " {{ .CommonAnnotations.summary }}"
api_url: "https://hooks.slack.com/services/...."
send_resolved: true
Code Pushed. Deployed to prod. (On track for a 5-star rating in my year-end appraisals).
A few days later, when the team which asked for this started to trigger alerts, they reported that they were getting Slack messages twice for a single alert.
On checking this, I see that alerts are repeated exactly after 10 minutes.
Ah, yes, I had set group_interval
as 10 minutes.
From alertmanager docs
# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.) If omitted, child routes
# inherit the group_interval of the parent route.
[ group_interval: <duration> | default = 5m ]
So group_interval
shouldn't be the issue; it should trigger alerts only 10 minutes after another alert is triggered within 10 minutes after the first alert is sent. However, I could find only one alert being triggered as per logs.
Also, even if multiple alerts were triggered, I ensured they were considered separate groups by adding a UUID to the group_by
parameter alertname
. We were still seeing duplicate alerts exactly after the group_interval
duration.
The create alert API had another request body param - endsAt
. This indicates the alert's end time, after which the alert is considered resolved.
I set endsAt
to be 1 minute in the future so that after the initial group_wait
time of 10s, the alert is fired, and before the group_interval
time of 10m, the alert ends. So the duplicate alert shouldn't be sent, right? But no, alerts are still repeated after group_interval
10m. :angry:
Now, I am starting to question my sanity. A StackOverflow comment suggested setting a high value for group_interval
. I put 12h, but the alerts still got repeated. Now, after 12h.
I thought this would not work, and I had a fallback: scrape alertmanager and call a Slack webhook to send the alert.
I decided to go through the configs one last time.
- match:
team: intern-oncall
receiver: slack-receiver
group_by: ['alertname']
group_wait: 10s
group_interval: 10m
repeat_interval: 4h
continue: true
- name: 'slack-receiver'
slack_configs:
- channel: '#oncall-alerts'
text: " {{ .CommonAnnotations.summary }}"
api_url: "https://hooks.slack.com/services/...."
send_resolved: true
send_resolved
??
# Whether to notify about resolved alerts.
[ send_resolved: <boolean> | default = false ]
I had this set to true. Who is triggering resolved alerts? Could this be the duplicate alert?
I set it to false, and I no longer got duplicate alerts!
During all this time, I have not been getting duplicate alerts. The second alert was the alert resolved notification. Alertmanager was internally triggering the same alert for the resolved alert notification. It was triggered right after the alert was resolved. But since we had a group_interval
of 10m, and the resolved alert got added as part of the same alert group, it got sent after 10m.
I wasted quite some time and mental space on this, but this made alertmanager, which was once a black box for me, into a white box.
This is the ideal flow of an alert based on the values of the configuration parameters below.
startsAt, endsAt, group_by, group_wait,
group_interval, repeat_interval, send_resolved
One or more of these states can be skipped. For example, setting the endsAt
to be before the group_wait
time can prevent the alert from being fired. It goes from unprocessed -> pending -> resolved.
Have to be careful next time when copy pasting boilerplate configurations.
References