ESX + NFS Datastores

Over the last week, I have been in the process of applying the latest patches to one of the VI3 environments I manage. While looking for potential problems I noticed that a single ESX server had lost access to all of its NFS datastores. All other VMs in the cluster, which connected to the same NFS datastores, appeared to be connected properly. I restarted the management services on the node hoping to fix the issue and continue with the upgrade. Unfortunately, restarting the management services had no effect (remember while restarting the management services should be one of the first steps and does solve a lot of VMware issues, it is not the only step). I verified that the host was configured properly and that no configuration changes had recently taken place. I also had the networking team verify that the switch ports were configured properly.

All checks came back normal, so what was going on?

Continue reading

Have you restarted your management services today? (Cont.)

In my last blog entry, I spoke about the importance of restarting management services when troubleshooting VMware ESX issues. One thing that I have noticed is that if you SSH to an ESX host and restart the management services you cannot cleanly exit out from the SSH session. To illustrate this point, SSH to a non-production ESX host and run the following commands:

You will notice the management services restart successfully, but your terminal hangs when trying to exit. What causes this and how can you fix it?

Continue reading

Have you restarted your management services today?

There are two VMware ESX commands that every VMware ESX administrator should know and master:

  • service mgmt-vmware restart
  • service vmware-vpxa restart

You may notice that for almost every VMware problem I blog about, the first step in troubleshooting is almost always restarting the management services. The reason for this is simple, it is the quickest and easiest way to fix a majority of the ESX problems experienced. I would compare it to restarting Windows in order to fix a Windows OS problem.

So what do these two services actually do?

Continue reading

Unable to add an ESX host to vCenter

While this issue has been discussed at length both in the communities and in knowledge base articles (ex. http://kb.vmware.com/kb/1003409), I cannot find a single KB article that lists every step I would perform to fix the issue and the order in which I would perform them.

There are two different kinds of ESX to vCenter connectivity problems that I would like to discuss:

  1. Initially adding an ESX host
  2. Reconnecting a disconnected ESX host

In my experience, the first problem is almost always caused by a DNS or network connectivity issue. To solve this, first try to add the ESX host by IP address instead of FQDN. If this works, ensure the ESX host and vCenter Server have the appropriate DNS servers, that they can resolve through them (i.e. nslookup) to other hosts, and most importantly that they can resolve to each other. If/When DNS is working properly, try to connect between the ESX host and vCenter server via ping and SSH.

In the case of a disconnected ESX host, the second problem listed above, simply restarting the management services fixes the issue most of the time:

The important thing to note is that after restarting the management services you may need to wait several minutes in order to confirm whether or not the issue is resolved (i.e. the host reconnects to vCenter). If this fails, in some rare cases closing out of the VI session and establishing a new connection resolves the issue. If the issue is still not resolved, disconnect the ESX host from vCenter and then manually remove the vmware-vpxa rpm from the host:

Finally, reconnect the ESX host. As a last-ditch effort, if the host is still listed as disconnected or remains “In Progress” during the “Add Host” function, restart the vCenter Server service.

While the KB articles suggest a variety of other options, the above steps have always resolved the issue for me.

Phantom VM

For those of you who do not know, I am a VMware fanatic. From time to time, I will be posting blog entries on discoveries I have made, problems I have resolved, and general knowledge I would like to share. Last week, an interesting problem was brought to my attention.

A colleague called me who was in the process of rebuilding an environment after a RAID crashed due to multiple failed drives .This was a testing environment so no monitoring was in place and no backups were kept. At the time, my colleague was redeploying software firewalls and ran into an issue of the firewalls refusing to cluster together. After investigating the logs, it appeared that a duplicate IP was causing the problem. A network engineer traced the MAC address through the switch fabric and found the duplicate IP coming from a VM port group NIC on one of the ESX servers. All VMs were checked on that host, but none of them had the IP address in question. So where was the phantom VM?

Continue reading