Hung Server/VMs post ESXi 5.0 upgrade

I have a home lab running vSphere on some PowerEdge T110 servers. My environment was running 4.1, but I recently (6 months ago!) decided to upgrade to 5.0. After I upgraded I started to experience VMs becoming inaccessible on a single server. I attempted to log into the Tech Support Mode on the ESXi server and noticed that once I typed in my password the server hung and never returned. Upon reboot the issue was resolved, however this issue kept reoccurring. My other servers were working without issue so I first looked at hardware. The server experiencing the issue reported no hardware issues and the ESXi logs looked relatively clean.
So what was going on?

NOTE: This is a post I had written several months back, but never had the chance to finish. The issue outlined below is one of the many reasons it has taken me so long to get my home lab back online!
With limited time to troubleshoot the issue, I used the reboot workaround whenever the issue came up. As this issue occurred more than once a week, I knew I needed to find the time to fix the issue. A couple weeks into the issue VMware released 5.0U1. I applied the update hoping it would address the issue, but unfortunately it did not. Next, I decided to take a look at the firmware versions on the server. I upgraded the BIOS, ESM, and RAID controller on the server to the latest versions (individually and manually – more on this in a future post about Dell firmware). Within a few days the issue was seen again… Next, I searched for updated VIBs as I had not used the Dell ESXi ISO. I was able to find a Dell-OMSA and LSIProvider VIB so I applied them to the server and performed the necessary reboots (note: 5.0U1 did provide a necessary updated driver that would otherwise need to be applied). Unfortunately, the issue was still not resolved…�Then I found the Dell ESXi ISO. The README for the ISO listed the included drivers. A quick listing of the current versions on the server showed:

~ # esxcli software vib list | grep sas
scsi-megaraid-sas     5.34-1vmw.500.1.11.623860           VMware  VMwareCertified   2012-04-27
scsi-mpt2sas          06.00.00.00-6vmw.500.1.11.623860    VMware  VMwareCertified   2012-04-27
scsi-mptsas           4.23.01.00-5vmw.500.0.0.469512      VMware  VMwareCertified   2012-01-07
~ # esxcli software vib list | grep -i lsi
LSIProvider           500.04.V0.24-261033                 LSI     VMwareAccepted    2012-05-22

Turns out the VMware mpt2sas driver was older than what the Dell ISO included. I updated this driver and the issue was finally resolved or so I thought… Several weeks later the issue came up yet again. At this point I was convinced it was a hardware issue, specifically a hard drive, but I was unable to prove it. Turns out the Dell server comes with a diagnostic utility that can be selected during boot. This utility provides a deep scan option, which thorough tests the hard drives as well as other components. After running for over a day (I have 4TB of storage in the server) it finally found issues with one of the hard drives! I have just replaced the drive (on 1/7) and completed the upgrade to vSphere 5.1. The server is back online and I have my fingers crossed that the issue is finally resolved. I do find it interesting that the issue did not appear until I upgraded from 4.1 to 5.0, but that might just be a coincidence.
My primary reason for writing this post is to highlight the steps necessary to troubleshoot server/OS issues. The following are key takeaways:

Always start by checking log messages – A majority of issues can be found and fixed based on log messages. Unfortunately, in my case the logs did not turn up anything pointing to the underlying issue
Look at what changed – Whenever an issue is uncovered it is important to understand what changed in an environment to see if the change may have contributed to the issue. It is always a good idea to have a rollback plan and execute it if new issues arise
When dealing with a server/OS issues especially with hypervisors check firmware versions – While I believe that firmware version issues are less likely to cause issues today than a couple years ago, it is always a good idea to check the release notes on newer firmware versions and see if they are applicable.
In terms of VMware, also check for updated vendor VIBs – Many vendors publish vendor specific hypervisor releases with custom drivers. Also be sure to check the compatibility guides published by the OS provider and server vendor when upgrading the OS.
Check for hardware problems – When all else fails perform hardware diagnostic checks. While the OS is typically able to detect hardware problems and many servers have the ability to detect hardware problems it is possible that hardware issues are occurring that go undetected. If you have tried everything else and are experiencing issues it is always a good idea to try some tool that is able to thoroughly test your hardware.

For those interested, here are some links I found while troubleshooting the issue that I thought were helpful:

http://blog.rebelit.net/?p=283
http://tinkertry.com/lsi92658iesxi5/
http://support.dell.com/support/edocs/software/eslvmwre/VS_5/index.htm

Hung Server/VMs post ESXi 5.0 upgrade

Related

Steve Flanders

Leave a Reply Cancel reply

Share this:

Related

Steve Flanders

Leave a Reply Cancel reply