For those of you who have never had the pleasure of patching ESX prior to VMware Update Manager (VUM), be thankful. Prior to its inception, the process of patching ESX hosts was repetitive and extremely error prone as it was completely manual. Today, this process is fully automated, but that is not to say the automation does not come with its fair share of issues. In this entry, I will be focusing on a couple of VUM issues I have experienced and the troubleshooting steps I have taken to resolve them.
Entering Maintenance Mode
The first step in remediating an ESX host is to put the host into maintenance mode. On occasion, I have experienced ‘Operation Timed Out’ errors reported when VUM performs the command. Upon manually performing the ‘Enter Maintenance Mode’ task the same issue is experienced. Oddly enough, manually migrating each powered on VM to another ESX host is successful and once complete the ‘Enter Maintenance Mode’ task completes successfully.
VMware Update Manager had a failure
In some cases, when either manually putting the ESX host in maintenance mode prior to running the remediate task or letting the remediate task handle the maintenance mode request, the ‘Remediate Entity’ task fails with the above error.
Looking at the events of the ESX host I have seen entries such as:
- Failed to scan <host> for updates
- Remediation failed for <host>: SingleHostRemediate: Unknown Host Error:
- Failed to install update <patch>, <patch>, <patch>…
In order to solve the problem. I have tried the following troubleshooting steps:
- Checking /var/log/vmware/esxupdate.log – everything looks good
- Checking the overall health of the ESX host via /var/log/vmkwarning, /var/log/vmkernel, and /var/log/vmksummary – everything looks good
- Rebooting the ESX host – does not resolve the issue
In either of the above scenarios, being manually putting the ESX host in maintenance mode prior to running the remediate task or letting the remediate task handle the maintenance mode request, the task fails while the ESX host is in maintenance mode. As a test, I manually attempted to take the host out of maintenance mode and received an ‘Operation Timed Out’ error. As such, I decided to restart the management services on the host. After doing so, the host successfully exited maintenance mode. At this point, I attempted the remediation task again and received a new error message:
Patch installation failed. ERROR: Another esxupdate installation (PID=3071) is running. Please wait for that installation to finish first. PID:3071
Based on this error message, it appeared that the installation was underway, but had probably timed out from vCenter Server. If the process does not complete after some time, say a couple hours or in worse case a day, the process is likely hung and should be manually killed using kill -9 <PID>. After killing the PID, try to remediate the host again and the task should complete successfully.
One final thing to note about these above issues is that I have seen it mostly when attempting to remediate an ESX host over a WAN though on occasion I have experienced the same problems over a LAN. Hopefully this information is helpful!
© 2010, Steve Flanders. All rights reserved.