No matter how careful you are. No matter how many times you’ve checked your parameters and state conditions. No matter how many times you tested it out of production.

It happens.

Some misconfigured rule – or even an event that happens much differently in a production environment then in test – begins firing off alerts.  Maybe you don’t notice it right away – perhaps you haven’t setup notifications for this particular rule.

But then, one day you start getting emails from your RMS with scary event IDs like 2115 (Data source not receiving a response), 25017 (Backlogged event processing) and 29202 (Inconsistent database state).

So you decide to investigate, and open up the Operations Console.

Only… wow, it’s running a lot slower then it normally does. Insanely slow.
You click onton Monitoring > Active Alerts – and then wait. And wait some more. As our once friendly green progress bar seems to start taunting you. So you lock the desktop and go chat up that new girl they hired. Wow, she’s pretty amazing right? Funny and smart as a whip, too.

Feeling happy and content after working your suave IT skills on her, you literally float back to your desk and unlock your desktop. Wasn’t there something bothering you before? Oh well, must have not been all that important. You peek up from your cube and catch a glimpse at her, then move those eyes down and see your still open Operations Console. The evil green bar still chugging away. But then you also see why…

A rogue rule quickly causes the alert count to soar to over 140,000

A rogue rule quickly causes the alert count to soar to over 140,000

And your nemesis, the green progress bar, it still keeps going. That number is rising faster then your blood pressure right now.

Must be a bug, eh? Ok, well, we’ll just check it via SQL to be sure – so you open SQL Studio and run

1
Select Count(*) from Alert

And then your informed, without any gentleness of a WWII nurse as depicted in the movies, that you have a lot of open alerts.

Over 250,000 alerts according to SQL

Over 250,000 alerts according to SQL

Wow! You better fix this!
And you better do it on the RMS, because it’s taking forever from your desktop.
You already have a general idea of which rule did it – that active alerts panel should be filled with it. So your first stop is to get back to authoring panel and either disable that rule or setup some proper alert suppression. Then we just have to deal with cleanup.

You turn, as always, to our friend PowerShell to help us out. Surely the easiest and most obvious solution to this problem is to run

1
2
3
4
5
$alerts = Get-Alert | Where-Object {$_.Name -match "MyRule"}
ForEach($alert in $alerts) {
$alert.ResolutionState = 255 # Close the alert
$alert.Update("")
}

Then just wait for the nightly alert grooming to happen to nudge it along with a SQL exec p_AlertGrooming

Only, when you try to do it, you get an OutOfMemory exception.

Out of memory!

Out of memory!

Now what to do?! The console is crippling slow – if you had to close the alerts that way your company would have gone bankrupt during the Dot Com Re-Burst of 2799! And when you try with PowerShell, you’ve run out of memory!

That’s where I was, until I talked to an unnamed friend((If you want to be named, just let me know. Better to err on the side of caution and all that)) from MS that really helped me out. That, combined with hindsight, allows me help you out as well!

How To Clean Up an Alert Storm

  1. Try the console. We’re going to assume it’s running slower than <Insert joke about large celebrities in the 1980s doing something they’re known for>, so we’ll move on.
  2. Try the Command Shell. $alerts = Get-Alert |? {($_.ResolutionState -eq 255) -or ($_.Name -match “Rule name if you know the naughty one”)}  – Running out of memory still?
  3. Try the same command, only instead of piping it to Where-Object, use the builtin filter object.
    $alerts = Get-Alert -criteria ‘WHERE ResolutionState = 0 AND Name LIKE ”%Rule Name%”’
  4. Still OOM? Try running both of those commands on the RMS, or another management server. Pick one with the most amount of memory, and hope for the best.
  5. Still receiving Out of Memory exceptions? Let’s stop using the OS to manage our memory. Open RegEdit and navigate to HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Config Service – change the value of Should Manage Memory from False to True. Stop and restart the config service.
  6. Now try your command again. It will take some time, but it will complete. Now lets try running the whole script to fix this:
    1
    2
    3
    4
    5
    $alerts = Get-Alert | Where-Object {($_.ResolutionState -eq 255) -and ($_.Name -match "MyRulename")}
    ForEach($alert in $alerts) {
    $alert.ResolutionState = 255
    $alert.Update("")
    }

    (Alternately, you can use the Resolve-Alert cmdLet, but from testing it’s not quite fast enough to keep up with the next step)
  7. Now when you ran step 6, it probably gave you a lot of errors when attempting to update the alert. That’s because there’s a small window of freshness to your alert object, and if you don’t update it within that window it becomes stale and unable to be used. To fix that, change the ForEach to look like this:
    ForEach($alert in $alerts) {
    $freshAlert = Get-Alert $alert.id
    $freshAlert.ResolutionState = 255
    $freshAlert.Update(“”)
    }
    That will grab a fresh version of that alert and update it.
  8. But what if you have thousands upon thousands of alerts? The above solutions could conceivably take days to run. Don’t worry, there’s a way around that, too.
    Before I show you, please be noted that this METHOD IS NOT SUPPORTED BY MICROSOFT and use of this method could possibly BLACKLIST YOUR OpsMgr INSTALL. It is the answer given out occasionally though, much to the dismay of the product group, so use that information how you’d like.
  9. Connect to your operations manager database and run the following update. This one updates every rule, but you could narrow it down with an additional AND WHERE RuleName = “My Rule Name”
    1
    2
    3
    Update Alert
    Set ResolutionState = 255
    Where ResolutionState = 0 and TimeResolved is Null
  10. When that’s completed, you’ll need to update the TimeResolved via:
    1
    2
    3
    Update Alert
    Set TimeResolved = '20-06-20 00:00:00.000'
    Where ResolutionState = 255 and TimeResolved is Null

    < Make TimeResolved be some day in the past so it will groom them out.
  11. Either wait overnight until the grooming jobs kick off or run
    1
    Exec p_AlertGrooming
  12. You’re done. Now don’t do it again!

[print_link]

© 2012 Pavleck.NET Suffusion theme by Sayontan Sinha