Skip to content

Month: June 2012

UCS Blades Power Off Unexpectedly

I was called into an interesting issue over the past week. I was told that a chassis worth of UCS blades had powered off without any apparent reason bringing down part of production. Initial troubleshooting of the issue showed no real culprits. UCSM was clean of errors except for an IOM post error. A show-tech command was initiated and a sev1 was opened with Cisco TAC. The technician on-call attempted to power on the servers by selecting them all in UCSM, right-clicking on them, and selecting reset. The blades powered on and came back online without issue.
So what caused the blades to power off unexpectedly?

PDL + New HA Settings in vSphere 5.0 U1

I was recently reading the VMware vSphere Metro Storage Cluster Case Study published May 2012 available here. One section that caught my attention stated (page 18):

Two advanced settings have been introduced in VMware vSphere 5.0 Update 1 to enable vSphere HA to respond to a PDL condition. The first setting, disk.terminateVMOnPDLDefault, is configured on a host level in /etc/ vmware/settings and should be set to True by default. This is a per-host setting, and the host requires a reboot for it to take effect. This setting ensures that a virtual machine is killed when the datastore on which it resides enters a PDL state. The virtual machine is killed as soon as it initiates disk I/O on a datastore that is in a PDL condition and all of the virtual machine files reside on this datastore. If virtual machine files do not all reside on the same datastore and a PDL condition exists on one of the datastores, the virtual machine will not be killed. VMware recommends placing all files for a given virtual machine on a single datastore, ensuring that PDL conditions can be mitigated by vSphere HA. VMware also recommends setting disk.terminateVMonPDLDefault to True. A virtual machine is killed only when issuing I/O to the datastore. Otherwise, it remains active. A virtual machine that is running memory-intensive workloads without issuing I/O to the datastore might remain active in such situations.
The second setting is a vSphere HA advanced setting called das.maskCleanShutdownEnabled. It was introduced in VMware vSphere 5.0 Update 1 and is not enabled by default. It must be set to True on vSphere HA cluster(s). This setting enables vSphere HA to trigger a restart response for a virtual machine that has been killed automatically due to a PDL condition. This enables vSphere HA to differentiate between a virtual machine that was killed due to the PDL state and a virtual machine that has been powered off by an administrator.
VMware recommends setting das.maskCleanShutdownEnabled to True to limit downtime for virtual machines residing on datastores in a PDL condition. When das.maskCleanShutdownEnabled is not set to True and a PDL condition exists while disk.terminateVMonPDLDefault is set to True, virtual machine restart will not occur after virtual machines have been killed. This is because vSphere HA will determine that these virtual machines have been powered off or shut down manually by the administrator.

A couple things stood out to me:

VMworld Call for Papers – My Top 20

Phew…just want through the 1000+ sessions that made it into public voting for VMworld 2012 and casted my votes. Overall, there are a lot of great sessions and great presenters this year so I am sure the voting will be tight. I figured I would share what I would call my top 20 votes for this year. For all of my votes I made my decision based on several factors:

  1. Does the topic interest me?
  2. Is the abstract well written and detailed?
  3. Is the speaker/company known?

A couple important notes:

  • These are all my personal choices and not endorsements of any kind
  • My top 20 sessions are ordered based on session number and not my ranking preference