In MOM 2005, virtually everything was a rule. A rule looked for an even in the event viewer, a line in a log file, a return code from a script, etc and fired off an alert (Or did another action). It was essentially ‘dumb’, because it had no idea whether or not if an even it raised was ever fixed. It just fired them off every time it saw it.
Enter OpsMgr 2007. It introduced us to an old concept of the ‘monitor’. The monitor is a multi-state event. It watches for multiple items; something will set a particular item into a failed or degraded mode, and there is a corresponding event that marked it as being healthy again. This is wonderful, as it helps minimize the amount of open alerts sitting in your system at any given time. Less open alerts means we have more relevant information to look at.
When it comes to core Windows monitors, it works beautifully and 100% of the time. If you cross a memory threshold, an event is created and an alert goes out (If you’ve set it up to alert). When the memory drops below this threshold, then the monitor marks that particular object as being in a Healthy state again and, if you’ve allowed it to, it auto-closes the alert.
When this doesn’t work beautifully and 100% of the time is when you need to rely on 3rd party agents and management packs. I’ll use the HP Management Packs as an example, because that’s what I’ve been facing recently.
The way OpsMgr knows about hardware events that happen on an HP machine is because the HP agents themselves will place an event in the Event Viewer and/or send an SNMP trap about it. Works flawlessly to create an event in SCOM about an unhealthy object. What doesn’t work perfectly is the corresponding event that marks that system as being healthy again.
The reason for this seems to depend on the exact configuration of a server, the version of the HP agents, and the actual event itself. If there is an event, such as a power supply failing, the log is populated and SCOM creates an event saying “Power Supply #1 degraded.”. When that power supply is replaced, it won’t necessarily auto-resolve the event, because instead of seeing “Power Supply #1 Healthy”, the HP agents might instead log “Power Supply (Serial number: FD30401104-P) Inserted into Bay #0″. The monitor isn’t looking for that, and so it isn’t aware that that is the corresponding ‘good’ event, and the event stays open.
So theoretically you could replace a failing piece of hardware, such as a Power Supply, which doesn’t auto-resolve and then in the future have that same PSU die, which won’t cause a new alert and literally leave you ‘powerless’ to know what is going on.
Now, in a normal deployment of OpsMgr this isn’t to large of a concern. There are always eyes on the console or emails being sent. Someone will see it, fix it, then ensure the event is closed.
The current situation I’m in, however, doesn’t work this way. SCOM is being used consoleless to monitor a group of monitoring tools. Essentially it’s here to keep ‘them’ honest, and to ensure there’s another level of defense to protect us and let us know when a failure has occurred.
Because of this, those slight discrepancies in the HP agents and the HP management pack aren’t acceptable. But OpsMgr really doesn’t have a way of being run without anyone paying attention to it – or does it?
It actually does. What I’ve setup at this site is a PowerShell script which runs every 4 hours and resolves all the open HP alerts.The HP Agents themselves will run a self-check every hour or so, and log that “Power Supply #1″ is still failed. Because we’ve already cleared that alert, SCOM will pick it up again and re-fire the event, the alert, and all that jazz. In essence, we’ve created a ‘nag’ feature in SCOM.
This is beneficial in our case, because the current setup of OpsMgr where I’m at is mainly there to watch the other monitoring tools. This ‘nag’ lets us know that the problem was either not taken care of, or was not alerted on – thus ‘keeping them honest’.
How we do all this is very simple – the OpsMgr Command Shell has almost everything we need.
We’ll use Get-Alert to bring back a list of all open HP events, and Resolve-Alert to close them, adding a comment that we automated this.
To find the HP alerts, we need to match against the MonitoringObjectFullName property inside the alert. Through trial and error, I noticed that every single HP object began with “HewlettPackard”. So we’ll match against that, picking all alerts that don’t have a resolution state of 255 (Closed).
From there, we cycle through the alert array, passing each one to Resolve-Alert, along with a -comment – in my case I used “Closed by Powershell – see (link) for more details” with a link to the internal Wiki.
And that’s really all that there is to it. Mind you, I’ve done a lot more in the script, as you’ll see below. It measures how long it took to bring up the alerts, counts how many were per severity, the repeat count, etc then creates a PropertyBag and submits all the information to OpsMgr for reporting. It then also logs it to the eventviewer.
Download SCOM-Resolve-HardwareAlerts.ps1
This script is best setup to run every 4 hours or so. It’s setup as a generic ‘timed script’ inside of SCOM. If you’d like more info on setting up SCOM to work with Powershell more properly, see Brian Wren’s post here.
Here’s the script:
# ==============================================================================================
#
# Microsoft PowerShell Source File — Created with SAPIEN Technologies PrimalScript 4.1
#
# NAME: SCOM-Resolve-HardwareAlerts.ps1
#
# AUTHOR: Jeremy D. Pavleck , JPavleck@GMail.com
# DATE : 6/11/2008
#
# COMMENT: When run, will gather all open HP alerts and mark them as resolved, setting a user
# defined comment as well. It will then log to the event viewer it has done so.
#
# NOTES: The "Object Name" we use to determine what rules we want to resolve comes from the
# MonitoringObjectFullName field of Get-Alert.
# Also, you’ll need to either set this command to start in your SCOM2007 dir (By default
# C:\Program Files\System Center 2007 or edit Microsoft.EnterpriseManagement.OperationsManager.ClientShell.Startup.ps1
# in said directory and change the dot source reference from current directory to the complete path.
#
# When calling this from an OpsMgr scheduled command, use
# powershell -PSConsoleFile "C:\Program Files\System Center Operations Manager 2007\Microsoft.EnterpriseManagement.OperationsManager.ClientShell.Console.psc1" -command "& {C:\Script\Path.ps1}"
#
# ==============================================================================================
# Ensure that the OpsMgr snap-in is there
Get-PSSnapin -name Microsoft.
EnterpriseManagement.
OperationsManager.
Client -ErrorAction SilentlyContinue
If (!$?
) {
throw
"OpsMgr Console not loaded – please run with -PSConsolfile ‘X:\Path\To\Microsoft.EnterpriseManagement.OperationsManager.ClientShell.Console.psc1′"
} else {
# CHANGE THIS to match the path in your system. Or don’t.
.
"C:\Program Files\System Center Operations Manager 2007\Microsoft.EnterpriseManagement.OperationsManager.ClientShell.Startup.ps1" # Load OpsMgr stuff.
}
# Create some counters
$iinfo = 0
$iwarn = 0
$ierr = 0
$icrit = 0
$iunk = 0
### Configuration Section ###
$objectName = "HewlettPackard" # All of the HP objects start with this
$comment = "Automatically Resolved via PowerShell" # Added to alert
# Create the SCOM Script API object, so we can shove this info into the database
$momapi = New-Object -comObject "MOM.ScriptAPI"
# Grab all alerts that match MonitoringObjectFullName and are not Closed
# Time the whole thing for no reason
$findAlertsTime = Measure-Command {
$openHPAlerts = get-alert | Where-Object {
($_.MonitoringObjectFullName -match $objectName) -and ($_.ResolutionState -ne 255)
}
}
# Let’s grab some stats about what we grabbed first, before we resolve them.
$openCount = $openHPAlerts.Count
$totalFindTime = ([datetime]($findAlertsTime.ticks)).ToString("HH:mm.ss")
# Create a property bag to hold values to send to SCOM
$pbag = $momapi.CreatePropertyBag()
$pbag.AddValue("Total_Open", $openCount)
$pbag.AddValue("Total_FindTime", $totalFindTime)
# Resolving them couldn’t be simpler
# Lets count the severities we’re clearing, though.
foreach($alert in $openHPAlerts)
{
switch ($alert.Severity) {
"Information" {$iinfo++}
"Warning" {$iwarn++}
"Error" {$ierr++}
"Critical" {$icrit++}
}
$progress += "Server: " + $alert.NetBiosComputerName + " – Rule ‘" + $alert.Name + "’ – Repeat Count: " + $alert.RepeatCount + " `n";
# $pbag.AddValue("AutoResolveFor", $alert.NetBiosComputerName)
resolve-alert -comment $comment -Alert $alert
}
$pbag.AddValue("Info_HP_Alerts", $iinfo)
$pbag.AddValue("Warn_HP_Alerts", $iwarn)
$pbag.AddValue("Err_HP_Alerts", $ierr)
$pbag.AddValue("Crit_HP_Alerts", $icrit)
# Submit property bag to SCOM
$momapi.Return($pbag)
# Log eventviewer event to let us know what we did
# Severities: 1 = Error, 2 = Warning, 4 = Informational – "Script Name", "Event ID", "Severity", "Description"
$momapi.LogScriptEvent("SCOM-Resolve-HardwareAlerts.ps1", 926, 4, "Successfully resolved " + $openCount + " alerts. `nReport:`n" + $progress)