VMware Aria Operations Alert Hygiene
Alerts, so many Alerts! I just logged into my Aria Operations lab, went to the Troubleshoot - Alerts tab and there are thousands of Alerts.
Alerts can be super powerful, but they can also be noisy and generate spam if you're making notifications on them, so it's important to maintain proper Alert hygiene. What does that look like?
Well, first you'll want to make sure your Alerts are being disposed of properly, which you can confirm via Administration - Global Settings - Data Retention.
The first setting, Symptoms/Alerts, sets the number of days to keep canceled (Inactive) Alerts and Symptoms. The default is 30, I've adjusted mine to 10 days to clean out old Alerts and Symptoms that are Inactive. Documentation can be found here.
The second setting, External Event Based Active Symptoms, sets the number of days to keep Active Symptoms based on External Events. For example, if you're running the Cisco UCS Management Pack, Cisco UCS Manager generated hardware faults generate Aria Operations Events which in turn trigger Symptoms. There is no clearing mechanism for these externally generated events, other than clearing them manually or clearing them here. I've adjusted this to be 10 days as well, which is the minimum.
Adjusting just these two settings will help maintain better Alert hygiene, but there are still times you'll want to bulk cancel/delete Alerts, which we can do a couple different ways. You can always do it manually via the UI, say for example you want to cancel/delete all ESXi Host System Alerts, go to Troubleshoot - Alerts - Group by Object Type - select Host System - ACTIONS - Cancel Alert.
Once canceled, you'll notice all Alerts are Inactive, as shown by the Inactive light build in the Status column.
You can then delete those same alerts by selecting the Host System grouping - ACTIONS - Delete Canceled Alert.
Once complete, all Alerts against ESXi Hosts have been removed. You can do the same by different groupings, say for example you want to delete all Alerts from last week, you'd Group by Time, select Last Week, then go through the same cancel then delete exercise.
There's another way to do this, programmatically via the API, here's how. You can delete all canceled (Inactive) Alerts with DELETE /api/alerts/bulk like this.
Once run, all Inactive Alerts will be deleted. Below you will see a before and after of my Inactive Alerts.
Now, if we first need to cancel Alerts (to make them Inactive), use the POST /api/alerts/query to get the Alert UUID. I'm capturing all Active Critical, Immediate, and Warning Alerts here.
Once executed, you'll be given a list of Alerts, each of which includes an alertId field. Use this alertId in the uuids field along with the cancel action in the POST /api/alerts endpoint to cancel the Alert, something like this.
You can now delete the Inactive Alert directly with DELETE /api/alerts/bulk, like we did previously. Ideally, we'd also have the ability to bulk cancel Alerts, I've opened a feature request for it. For more information about the Aria Operations API, go here, enjoy!