During a recent change ticket in a non-production environment I was called into an All Paths Down (APD) situation on some ESXi hosts. For those who do not know what an APD is, it is when an ESXi host loses all paths to its shared storage. The ESXi hosts impacted in my particular case were hosting several virtual vCenter Server (vCS) instances. The virtual vCS VMs were being used by a group of developers for SOAP calls in order to provision, modify, and delete VMs. Once all the vCS instances had been recovered and vSphere client sessions verified that the instances were operational, the instances were turned back over to development. The developers immediately began complaining about SOAP commands failing to the vCS instances.
What was going on?
After digging into the issue, it was confirmed that the vCS and ESXi logs in general looked clean, operations via the vSphere client were working without issue, and commands run from PowerCLI were also successful. Digging deeper into the developer logs showed some SSL errors between the vCenter Server and ESXi hosts. This is something I had seen in the past and the resolution is typically to remove the ESXi hosts from the vCS and then rejoin it. Performing these operations did clear up the SSL error, however it did not fix the SOAP issue.
Watching as the developer ran the code triggering the issue I could see VMs powering on and modifying, but then quickly being deleted and then operations failing. The exact error that was seen was:
INFO — : Could not upload file: https:///folder//env.iso?dcPath=&;dsName=, status code: 500
Running out of options, I thought this might be a vCS DB or VPXA issue. Typically with DB issues the vCS service will not start, but this is not always the cause. Attempting to address the issue, I decided to remove the ESXi hosts from the cluster again, but this time run the AAM and VPXA (hosts were 4.1) uninstallers on the hosts. While these operations worked without issue when I tried to add the hosts back into the vCS I received the following error:
Cannot install the vCenter agent service. Cannot upload agent
I quick KB article search turned out a couple things:
- KB#1031905 – I confirmed port 902 was open using telnet on the vCS.
- KB#1026917 – Slightly different error message but same hostd log messages been seen, but confirmed root filesystem was not full
With a suggestion from a colleague, I added the ESXi hosts to a known good vCS instance and wouldn’t you know the operation was successful. With the VPXA installed I was able to add the host back to the vCS having issues. With the ESXi host back on the correct vCS I noticed that HA failed to configure and manually reconfiguring resulted in the same error as when VPXA was not installed. In addition, if I added the host to a good vCS and then removed the host from the good vCS I was unable to add the host to the bad vCS as VPXA was cleaned removed on removal from the vCS.
After much trial and error, I came up with the following workaround to fix the SOAP errors:
- Open the vCS with the vSphere client
- Move VMs off local storage if applicable
- Disable HA on cluster
- Put first host in maintenance mode
- Remove host from vCS
- Enable SSH to ESXi host in UCS
- SSH to ESXi and run123456~ # cd /opt/vmware/uninstallers//opt/vmware/uninstallers # ./VMware-vpxa-uninstall.sh/opt/vmware/uninstallers # userdel vpxuser<output>/opt/vmware/uninstallers # ./VMware-aam-ha-uninstall.sh</output><output></output>
### KEY STEPS – READ CAREFULLY ###
- Add ESXi host to a known good vCS # DO NOT REMOVE ONCE ADDED
- Add ESXi host back to vCS with issue # YES HOST IS STILL CONNECTED TO GOOD VCS
- Remove ESXi host from good vCS # HOST IS NOW DISCONNECTED SO SAFE TO REMOVE
### END KEY STEPS ###
- Exit maintenance mode
- Repeat steps 4-11 on second host
- Stop vCS service
- Stop DB service
- Reboot DB VM if separate VM – wait for login prompt
- Reboot vCS VM – wait for login prompt
- Open vSphere client to vCS and confirm connectivity
- Enable HA on cluster and confirm success # DO NOT FORGET THIS STEP
- Try SOAP commands again and they should now work
I am not sure why this worked, but again I believe it to be a DB or VPXA issue. Over the next couple of weeks attempts will be made to reproduce the issue so a RCA can be attained. Hope this helps someone!
© 2012, Steve Flanders. All rights reserved.