UCS Blades Power Off Unexpectedly

I was called into an interesting issue over the past week. I was told that a chassis worth of UCS blades had powered off without any apparent reason bringing down part of production. Initial troubleshooting of the issue showed no real culprits. UCSM was clean of errors except for an IOM post error. A show-tech command was initiated and a sev1 was opened with Cisco TAC. The technician on-call attempted to power on the servers by selecting them all in UCSM, right-clicking on them, and selecting reset. The blades powered on and came back online without issue.

So what caused the blades to power off unexpectedly?

TAC combed the logs and asked if we either:

  1. Hit the power button on the front of the blades
  2. Selected multiple blades in UCSM, right-clicked, and select reset

The answer to number 2 was yes. This is when we were presented with bug id CSCty26754, which states:

Symptom:

Blades power off unexpectedly, and stay in an off state.

Conditions:

A shallow discovery has happened which puts the blade into its desired power state. Some examples of actions that can trigger a shallow discovery are:

  • Loss of any link between the FI and the IOM
  • Reset of an IOM
  • Killing a process with debug plugin
  • Re-acknowledge of a chassis

Blades that have been powered on with the following methods will be left in this inconsistent power state:

  • Pressing the physical power button on the front
  • Clicking the reset button on the server in the equipment tab
  • Right clicking the server in the list of servers on the equipment tab and selecting reset.

Workaround:

When a blade is powered off, only use the Power On button on the General tab to turn on the blade.

If the service-profile has a desired power state of Off, but the blade is actually On: Click the Set Desired Power State button that will appear on the General tab of the service-profile and change the desired power state to On. The Set Desired Power State button will disappear when the desired and actual power states match. Under the Status Details drop down, the Desired Power State will be changed too Up.

Turns out UCS has the notion of an actual power state and a desired power state (more on this in a future blog post). Certain operations can result in the two power states getting out of sync. While a disconnect in power states in itself is not an issue, a regression in UCS firmware versions 2.0-2.0(2l) can lead to blades powering off unexpectedly. In our case, the IOM post error resulted in the shallow discovery that powered off the blades.

So how can you check if your systems are vulnerable to this issue? First, you would need to check your UCS version and confirm you are running a version between 2.0 and 2.0(2l). Second, you would need to check the desired power state of each blade in your environment. To manually check the desired power state in each blade you can:

  1. Log into the UCSM GUI and under either the Equipment or Servers tab select each blade / service profile as appropriate. If once selected you notice a ‘Set Desired Power State’ option under the Actions section then the power states are not in sync. If you do not see the aforementioned option then the power states are in sync. (I have requested a bug/enhancement be created for this as well – I believe if the power states are in sync then the option should still exist under Actions either grayed out or with the ability to changed the desired power state.)
    ucs-power-state-in-sync
    Example of power states being in sync – no action required
    ucs-power-state-mismatch
    Example of power states in mismatch – correction required

     

  2. Log into the UCSM CLI – not sure the exact commands here, but must be possible

As you can see, manually checking each blade is time consuming and depending on the size of your infrastructure may not be feasible. In addition, while running a code version between 2.0 and 2.0(2l) you should frequently check to ensure power states do not get out of sync. As such, I worked with an awesome Cisco programmer by the name of Eric Williams and we created a PowerShell script, shown below, capable of detecting as well as fixing a power state mismatch. I was able to get this script tied into our monitoring to ensure the power states stay in sync while an upgrade change request is processed and approved.

NOTE: This is not an officially supported cmdlet. Your milage with this script may vary. Do not run this script in any production environment without first fulling understanding this script and of course testing it. Eric and I are not responsible for any problems you experience.

More information on how to properly change blade power states can be found here.

© 2012 – 2014, Steve Flanders. All rights reserved.

Leave a Reply