Phantom VM

For those of you who do not know, I am a VMware fanatic. From time to time, I will be posting blog entries on discoveries I have made, problems I have resolved, and general knowledge I would like to share. Last week, an interesting problem was brought to my attention.
A colleague called me who was in the process of rebuilding an environment after a RAID crashed due to multiple failed drives .This was a testing environment so no monitoring was in place and no backups were kept. At the time, my colleague was redeploying software firewalls and ran into an issue of the firewalls refusing to cluster together. After investigating the logs, it appeared that a duplicate IP was causing the problem. A network engineer traced the MAC address through the switch fabric and found the duplicate IP coming from a VM port group NIC on one of the ESX servers. All VMs were checked on that host, but none of them had the IP address in question. So where was the phantom VM?

I suggested putting the host into maintenance mode so that all VMs would be migrated to other hosts in the cluster. I was told the operation completed successfully, however the network engineer claimed the duplicate IP was still coming from the same ESX host. How could this be? Unfortunately, I was busy with several critical path issues and could not take a closer look. Since this was not a production environment, I suggested just rebooting the ESX host (this the simplest, but most intrusive and least definitive way to resolve the issue). While this did resolve the problem, it did not preserve any data, did not lead to a root cause, and may not have been an option in a production environment.
Given the appropriate time, I would have run ‘esxtop’, ‘vm-support -x’, and ‘vmware-cmd -l’ to see what VMs (if any) were running or at least registered on the ESX host. Assuming VMs were actually on the host, I would have restarted the management services as this may have made the VMs become visible on the host:

# service mgmt-vmware restart
# service vmware-vpxa restart

Next, I would have likely tried to re-register the VM(s) on the ESX host though I assume this would have failed because either the datastore path no longer existed or, if it did, no longer contained the VM data due to the bad drives:

# vmware-cmd -s register </path/to/vm/vm.vmx>

Assuming the VM files were no longer available, trying to stop the VM may not have be successful as the VM would have only been running in memory. In either case, there are multiple ways to stop or even kill unresponsive VMs. More information about this is available on the following KB article: http://kb.vmware.com/kb/100434. For future reference, I created a quickly one-liner that would kill VMs running on a host without requiring a reboot. WARNING: this will kill any VM running on the ESX host and as such should only be run when the host is in maintenance mode. I would advise against running this on a production environment unless you know what you are doing.

# for p in `ps x -o "%p %a" | grep vmkload | grep vmfs | awk '{print $1}'`; do kill -9 $p; done

I have read plenty about phantom VMs in the communities and unresponsive VMs in the KB articles. These unique cases always intrigue me and while I do not hope for a similar situation in a production environment, I look forward to troubleshooting such problems in the future.

© 2010, Steve Flanders. All rights reserved.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top