Over the last week, I have been in the process of applying the latest patches to one of the VI3 environments I manage. While looking for potential problems I noticed that a single ESX server had lost access to all of its NFS datastores. All other VMs in the cluster, which connected to the same NFS datastores, appeared to be connected properly. I restarted the management services on the node hoping to fix the issue and continue with the upgrade. Unfortunately, restarting the management services had no effect (remember while restarting the management services should be one of the first steps and does solve a lot of VMware issues, it is not the only step). I verified that the host was configured properly and that no configuration changes had recently taken place. I also had the networking team verify that the switch ports were configured properly.
All checks came back normal, so what was going on?
While many people may argue that the VMware logs should be the first thing checked with any VMware problem, I usually do some simple checks first. At this point, I was sufficiently happy with the overall configuration and decided to the check the logs where I noticed the following errors:
# tail -n 2 /var/log/vmkwarning Apr 29 19:35:51 esx04 vmkernel: 0:00:14:39.644 cpu3:1070)WARNING: NFS: 982: Connect failed for client 0x8611a18 sock 134350840: I/O error Apr 29 19:35:51 esx04 vmkernel: 0:00:14:39.644 cpu3:1070)WARNING: NFS: 898: RPC error 12 (RPC failed) trying to get port for Mount Program (100005) Version (3) Protocol (TCP) on Server (xxx.xxx.xxx.xxx)
In this particular cluster, the Service Console, VMotion, and VMkernel shared the same vSwitch, which consisted of two physical NICs. Since I was able to access the ESX host I knew the Service Console was working properly. Still thinking this could be a networking issue, I decided to run the vmkping command from the ESX host, which performs a ping operation using the VMkernel interface:
[[email protected] vmware]# vmkping <IP_OF_NAS> PING <IP_OF_NAS> (<IP_OF_NAS>): 56 data bytes sendto() failed (Input/output error)
I could not understand how you could have an I/O error during a ping operation. I next tried a vmkping from the ESX host to another, working ESX host VMkernel IP address and received a response. Based on this, I decided to ping from the NAS device to the ESX VMkernel, but did not receive a response. I followed this up by performing the same tests on a working ESX host:
[[email protected] vmware]# vmkping <IP_OF_PROBLEM_ESX_VMKERNEL> PING <IP_OF_PROBLEM_ESX_VMKERNEL> (<IP_OF_PROBLEM_ESX_VMKERNEL>): 56 data bytes 64 bytes from <IP_OF_PROBLEM_ESX>: icmp_seq=0 ttl=64 time=0.207 ms [[email protected] vmware]# vmkping <IP_OF_NAS> PING <IP_OF_NAS> (<IP_OF_NAS>): 56 data bytes 64 bytes from <IP_OF_NAS>: icmp_seq=0 ttl=64 time=0.207 ms
Following the tests, I had gathered the following information: the ESX server that was having problems can see other, working ESX hosts, but not the NAS device; the NAS device can see all ESX hosts except the ESX host experiencing I/O errors; a working ESX server can see the NAS device and the ESX host experiencing I/O errors. Now what do you do? Since the VMkernel was in a vSwitch with two physical NICs, I decided to force the VMkernel to use one NIC and then tried the vmkping test again. I then forced the VMkernel to use the other physical NIC and ran vmkping one more time. Interestingly, on one physical NIC I received the I/O error and on the other I received no response. Could this be a networking issue?
As a last ditch effort, I decided to take note of the VMkernel MAC address and then delete and re-create the VMkernel in the same vSwitch. As it turns out, this created a new VMkernel with the same MAC address as the original VMkernel and after retesting I was able to confirm that the problem still existed. Next, I decided to change the IP address of the VMkernel such that it was in a different network block then the existing IP address (I picked a random IP) and also changed the VLAN ID to something I knew did not exist. Then, I created a new VMkernel with the original IP information. This gave me a VMkernel with the original IP information, but a different MAC address. Finally, I deleted the original VMkernel as it was no longer being used. Upon making this change, vmkping worked as expected and the NFS datastores reconnected to the ESX host.
I am extremely puzzled why this happened and how changing the MAC address had any impact. To ensure it was not a MAC address conflict, I had someone from the networking team confirm the old MAC address no longer existed and it was gone as expected. While I am hoping this is a permanent fix, I am extremely interested in the underlying issue. If it ever happens in vSphere I will open a support case in order to understand the issue better.
© 2010, Steve Flanders. All rights reserved.