Rescan All Hangs ESXi

I ran into a interesting problem the other day. I was brought into an environment that had a pair of ESXi 5.1 hosts connected to an iSCSI datastore. One host could see and access the datastore without issue while the other host showed no datastore attached. Per the administrator both hosts had been mounted to the datastore and the claim was that the environment had not be touched/changed in any way.
What was going on and how can you fix it?

I started by trying to access the host in a variety of ways including SSH, RVC, PowerCLI, etc. Every way resulted in a hung session (i.e. there appeared no way to access the system). After a long period of time, circa 20 minutes, the rescan operation finally completed, which allowed access back into the system. I immediately went to the log files to see what was going on. Looking in /var/log/vmkwarning.log I found messages like the following:

2013-02-18T19:31:41.943Z cpu0:8854)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba37:CH:0 T:1 L:1 : Task mgmt "Abort Task" with itt=0xc1e (refITT=0xc1d) timed out.
2013-02-18T19:31:47.774Z cpu0:8854)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba37:CH:0 T:1 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2013-02-18T19:31:47.774Z cpu0:8854)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000001 TARGET: iqn.1992-04.com.emc:cx.apm00123801401.b5 TPGT: 2 TSIH: 0]
2013-02-18T19:31:47.774Z cpu0:8854)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 10.16.199.5:52850 R: 10.16.199.2:3260]
2013-02-18T19:31:47.774Z cpu0:8854)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba37:CH:0 T:1 L:1 : Task mgmt "Abort Task" with itt=0xc20 (refITT=0xc1d) timed out.
2013-02-18T19:31:47.774Z cpu20:8431)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.6006016001603000ffa95b8bf474e211" state in doubt; requested fast path state update...
2013-02-18T19:31:48.174Z cpu18:8210)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.6006016001603000ffa95b8bf474e211" state in doubt; requested fast path state update...
2013-02-18T19:31:49.174Z cpu18:8210)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.6006016001603000ffa95b8bf474e211" state in doubt; requested fast path state update...
2013-02-18T19:31:50.174Z cpu18:8210)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.6006016001603000ffa95b8bf474e211" state in doubt; requested fast path state update...
2013-02-18T19:31:50.799Z cpu0:8854)WARNING: iscsi_vmk: iscsivmk_StartConnection: vmhba37:CH:0 T:1 CN:0: iSCSI connection is being marked "ONLINE"
2013-02-18T19:31:50.799Z cpu0:8854)WARNING: iscsi_vmk: iscsivmk_StartConnection: Sess [ISID: 00023d000001 TARGET: iqn.1992-04.com.emc:cx.apm00123801401.b5 TPGT: 2 TSIH: 0]
2013-02-18T19:31:50.799Z cpu0:8854)WARNING: iscsi_vmk: iscsivmk_StartConnection: Conn [CID: 0 L: 10.16.199.5:60510 R: 10.16.199.2:3260]
2013-02-18T19:32:01.383Z cpu0:8854)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba37:CH:0 T:1 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2013-02-18T19:32:01.383Z cpu0:8854)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000001 TARGET: iqn.1992-04.com.emc:cx.apm00123801401.b5 TPGT: 2 TSIH: 0]
2013-02-18T19:32:01.383Z cpu0:8854)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 10.16.199.5:60510 R: 10.16.199.2:3260]

Well these messages clearly did not sound good! It appeared that paths were having issues and that the iSCSI target was going offline. I confirmed on the storage array that the paths were successfully logged in and ensured the general health of the array. When that checked out I began to suspect the network. As you know, iSCSI using TCP/IP to transfer data. To ensure connectivity between the host and the array I ran a vmkping:

~ # vmkping 10.16.199.2
PING 10.16.199.2 (10.16.199.2): 56 data bytes
64 bytes from 10.16.199.2: icmp_seq=0 ttl=128 time=0.327 ms
64 bytes from 10.16.199.2: icmp_seq=1 ttl=128 time=0.368 ms
64 bytes from 10.16.199.2: icmp_seq=2 ttl=128 time=0.345 ms

This worked as expected. Next, I looked into the network configuration on the host. With the knowledge that vmk1 was the vmkernel interface being used for storage traffic I found the following:

~ # esxcli network ip interface list | grep -A 12 vmk1
vmk1
   Name: vmk1
   MAC Address: 00:50:56:xx:xx:xx
   Enabled: true
   Portset: DvsPortset-0
   Portgroup: N/A
   VDS Name: vds01
   VDS UUID: xx xx xx xx xx xx xx xx-xx xx xx xx xx xx xx xx
   VDS Port: 254
   VDS Connection: 1088536350
   MTU: 9000
   TSO MSS: 65535
   Port ID: 50331696

I was not armed with two important pieces of information:

vmk1 was on a distributed switch
vmk1 was configured with a MTU of 9000

This meant that jumbo frames wer intended to be used in the environment. Jumbo frames require end-to-end MTU configuration in order to work properly. Next I took a look at the VDS configuration:

~ # esxcli network vswitch dvs vmware list | head -n 12
vds01
   Name: vds01
   VDS ID: xx xx xx xx xx xx xx xx-xx xx xx xx xx xx xx xx
   Class: etherswitch
   Num Ports: 512
   Used Ports: 29
   Configured Ports: 512
   MTU: 9000
   CDP Status: listen
   Beacon Timeout: -1
   Uplinks: vmnic0, vmnic1
   VMware Branded: true
close failed in file object destructor:
Error in sys.excepthook:
Original exception was:

Well look at that, the VDS MTU is set to 1500. After confirming the proper MTU on the upstream switches and storage array I believed I had isolated the issue to the VDS. To confirm this was a MTU issue I ran the vmkping command again, but this time with some additional flags (http://kb.vmware.com/kb/1003681):

~ # vmkping -d -s 8784 10.16.199.2

Well look there, it does not appear that jumbo packets are being passed properly as we do not get a response. Actually, it is a little odd as I would have expected to see the following:

~ # vmkping -d -s 8784 10.16.199.2
PING 10.16.199.2 (10.16.199.2): 8784 data bytes
sendto() failed (Message too long)
sendto() failed (Message too long)
sendto() failed (Message too long)

In any case, let’s fix the MTU on the VDS and try again:

~ # vmkping -d -s 8784 10.166.199.2
PING 10.166.199.2 (10.166.199.2): 8784 data bytes
8792 bytes from 10.166.199.2: icmp_seq=0 ttl=128 time=0.473 ms
8792 bytes from 10.166.199.2: icmp_seq=1 ttl=128 time=0.310 ms
8792 bytes from 10.166.199.2: icmp_seq=2 ttl=128 time=0.311 ms

Viola! Now you can perform your typical two Rescan All… operations and you should be back in business.

Rescan All Hangs ESXi

Related

Steve Flanders

Leave a Reply Cancel reply

Share this:

Related

Steve Flanders

Leave a Reply Cancel reply