I was recently reading the VMware vSphere Metro Storage Cluster Case Study published May 2012 available here. One section that caught my attention stated (page 18):
Two advanced settings have been introduced in VMware vSphere 5.0 Update 1 to enable vSphere HA to respond to a PDL condition. The first setting, disk.terminateVMOnPDLDefault, is configured on a host level in /etc/ vmware/settings and should be set to True by default. This is a per-host setting, and the host requires a reboot for it to take effect. This setting ensures that a virtual machine is killed when the datastore on which it resides enters a PDL state. The virtual machine is killed as soon as it initiates disk I/O on a datastore that is in a PDL condition and all of the virtual machine files reside on this datastore. If virtual machine files do not all reside on the same datastore and a PDL condition exists on one of the datastores, the virtual machine will not be killed. VMware recommends placing all files for a given virtual machine on a single datastore, ensuring that PDL conditions can be mitigated by vSphere HA. VMware also recommends setting disk.terminateVMonPDLDefault to True. A virtual machine is killed only when issuing I/O to the datastore. Otherwise, it remains active. A virtual machine that is running memory-intensive workloads without issuing I/O to the datastore might remain active in such situations.
The second setting is a vSphere HA advanced setting called das.maskCleanShutdownEnabled. It was introduced in VMware vSphere 5.0 Update 1 and is not enabled by default. It must be set to True on vSphere HA cluster(s). This setting enables vSphere HA to trigger a restart response for a virtual machine that has been killed automatically due to a PDL condition. This enables vSphere HA to differentiate between a virtual machine that was killed due to the PDL state and a virtual machine that has been powered off by an administrator.
VMware recommends setting das.maskCleanShutdownEnabled to True to limit downtime for virtual machines residing on datastores in a PDL condition. When das.maskCleanShutdownEnabled is not set to True and a PDL condition exists while disk.terminateVMonPDLDefault is set to True, virtual machine restart will not occur after virtual machines have been killed. This is because vSphere HA will determine that these virtual machines have been powered off or shut down manually by the administrator.
A couple things stood out to me:
- disk.terminateVMOnPDLDefault is a *host* level setting. The document states, “(A DPL) condition indicates that a device (LUN) has become unavailable and is likely to be permanently unavailable.” More than likely if this is the case then all hosts that have the LUN attached (cluster level) will be impacted. Even if this is not the case, why would you want only a subset of hosts in a cluster to have this setting enabled while another subset have it disabled? I believe this should be a cluster level setting.
- das.maskCleanShutdownEnabled *is not* enabled by default. The document states, “this setting enables vSphere HA to trigger a restart response for a virtual machine that has been killed automatically due to a PDL condition.” If this is the setting’s function why is it not enabled by default? Even if disk.terminateVMOnPDLDefault is disabled having das.maskCleanShutdownEnabled enabled should have no impact, right? I believe this should be enabled by default.
- disk.terminateVMOnPDLDefault *is* enabled by default, but das.maskCleanShutdownEnabled *is not* enabled by default. The document states, “(The disk.terminateVMOnPDLDefault) setting enables vSphere HA to trigger a restart response for a virtual machine that has been killed automatically due to a PDL condition.” One would assume that the reason why the disk.terminateVMOnPDLDefault setting is enabled is so that VMware HA can restart the VM on another system if possible (otherwise the VM would remain hung and would need to be manually recovered.) However, the das.maskCleanShutdownEnabled setting is disabled by default so while the VM will be powered off in a PDL situation due to the disk.terminateVMOnPDLDefault HA setting, HA will not be able to power the VM back online if possible. This would be like configuring VMs to not automatically power on by default when HA is enabled. If disk.terminateVMOnPDLDefault is enabled by default then I believe das.maskCleanShutdownEnabled should be as well.
As I continued to read the case study, I came across the following (page 28):
VMware recommends configuring advanced options disk.terminateVMOnPDLDefault and dasmaskCleanShutdown Enabled to True. If they are not configured they are by default set to False vSphere HA will not take any action and the virtual machines affected by a PDL might not be restarted. This is described in depth in the VMware vSphere 5.0 Update 1 Permanent Device Loss Enhancements section of this paper.
Oh no, a mismatch in the documentation! Earlier it was stated that disk.terminateVMOnPDLDefault was enabled by default, but now the document states it is disabled by default. (Side note: dasmaskCleanShutdownEnabled is a typo and should be: das.maskCleanShutdownEnabled.) This begged the question, is disk.terminateVMOnPDLDefault enabled or disabled by default? Searching the Internet for disk.terminateVMOnPDLDefault turned up an article by one of the co-authors: http://www.yellow-bricks.com/2012/03/16/permanent-device-loss-pdl-enhancements-in-vsphere-5-0-update-1-for-stretched-clusters/. In the comments section someone had also questioned the wording and Duncan responded that both settings were disabled by default. Upon looking at a pair of servers I had running ESXi 5.0 U1 I confirmed that disk.terminateVMOnPDLDefault is *disabled* by default. I have reached out to Duncan requesting that the document be updated to avoid confusion.
In summary, I believe:
- disk.terminateVMOnPDLDefault should be a cluster level setting and not a host level setting.
- das.maskCleanShutdownEnabled should be enabled by default independent of whether disk.terminateVMOnPDLDefault is enabled or disabled.
UPDATE: I reached out to Duncan and below is what he said.
- disk.terminateVMOnPDLDefault is a VMkernel parameter and not an HA parameter thus why it is defined at the host level. I asked him if there was a use-case for setting this option without using HA and he could not think of one. With that said, he mentioned that the kernel does not do anything with HA advanced options and thus what it is not / can not be implemented there. I guess my confusion stems from thinking of disk.terminateVMOnPDLDefault like host isolation response. Host isolation response is determined on a host-by-host basis and yet is defined at the cluster level. It also does not really make sense out of the context of HA. Being big on consistency I would hope the disk.terminateVMOnPDLDefault parameter somehow moves under HA in the future.
- das.maskCleanShutdownEnabled was not enabled by default as it was introduced in an update pack (i.e. U1) and so not as to impact users functionality unexpectedly changes of any sort are not taken lightly. As for enabling this parameter in a future release, that appears to be unknown at this time, but I would hope that it is.
© 2012, Steve Flanders. All rights reserved.
Thanks again for posting this! Wish I came across it sooner. The documentation mismatch in the case study confused me, had me thinking the first PDL setting was enabled by default. Does the order of implementation matter (CLI change on hosts for 1st setting, then HA Advanced setting change)? Rolling reboot of hosts required after updating the settings file? Never messed with the settings file, I’m assuming this can be done while host has live VMs. Looks like the settings file is currently empty. Would be nice if there was a little KB for this, unless I’m missing it.
Hey Stacy – the order of implementation should not matter as the settings only take effect during a PDL. In fact you can enable either one individually without the other if you desire, but be sure you understand the consequences! 🙂 No reboot should be needed for the settings to take effect as they are only checked during a PDL. Yes, you can edit the settings file on a live host running VMs and by default the file is empty. I agree that a KB would be helpful – I will check if one has been requested. I believe a request has been made to allow the settings to be made from a common place (e.g. GUI) in a future release. I hope this helps!