Have you restarted your management services today? (Cont.)

In my last blog entry, I spoke about the importance of restarting management services when troubleshooting VMware ESX issues. One thing that I have noticed is that if you SSH to an ESX host and restart the management services you cannot cleanly exit out from the SSH session. To illustrate this point, SSH to a non-production ESX host and run the following commands:

[root@esx01] # service mgmt-vmware restart
Stopping VMware ESX Server Management services:
VMware ESX Server Host Agent Watchdog                  [  OK  ]
VMware ESX Server Host Agent                           [  OK  ]
Starting VMware ESX Server Management services:
VMware ESX Server Host Agent (background)              [  OK  ]
Availability report startup (background)               [  OK  ]
[root@esx01] # service vmware-vpxa restart
Stopping vmware-vpxa:                                  [  OK  ]
Starting vmware-vpxa:                                  [  OK  ]
[root@esx01] # exit
logout

You will notice the management services restart successfully, but your terminal hangs when trying to exit. What causes this and how can you fix it?

This problem is caused by restarting the mgmt-vmware service. The reason for this is because restarting this service spawns background child processes. To illustrate this point, SSH to a non-production ESX host and run the following commands:

[root@esx01] # ps
  PID TTY          TIME CMD
25197 pts/1    00:00:00 bash
25234 pts/1    00:00:00 ps
[root@esx01] # service mgmt-vmware restart
Stopping VMware ESX Server Management services:
VMware ESX Server Host Agent Watchdog                  [  OK  ]
VMware ESX Server Host Agent                           [  OK  ]
Starting VMware ESX Server Management services:
VMware ESX Server Host Agent (background)              [  OK  ]
Availability report startup (background)               [  OK  ]
[root@esx01] # ps
  PID TTY          TIME CMD
25197 pts/1    00:00:00 bash
25320 pts/1    00:00:00 vmware-watchdog
25323 pts/1    00:00:00 logger
25500 pts/1    00:00:00 ps
[root@esx01] #

You will notice that after restarting the mgmt-vmware service two new background processes are spawned: vmware-watchdog and logger. The reason why you are unable to successfully log out of the SSH session is because the default behavior of SSH is to wait for child processes to finish prior to exiting. SSH follows this standard because of a known race condition that may otherwise result in data loss. This problem is explained in depth at: http://www.snailbook.com/faq/background-jobs.auto.html.
To confirm this is actually causing the problem, repeat the previous test, but this time kill (i.e. kill -9) the two spawned processes prior to exiting:

[root@esx01] # ps
  PID TTY          TIME CMD
25197 pts/1    00:00:00 bash
25234 pts/1    00:00:00 ps
[root@esx01] # service mgmt-vmware restartStopping VMware ESX Server Management services:
VMware ESX Server Host Agent Watchdog                  [  OK  ]
VMware ESX Server Host Agent                           [  OK  ]
Starting VMware ESX Server Management services:
VMware ESX Server Host Agent (background)              [  OK  ]
Availability report startup (background)               [  OK  ]
[root@esx01] # ps
  PID TTY          TIME CMD
25197 pts/1    00:00:00 bash
25320 pts/1    00:00:00 vmware-watchdog
25323 pts/1    00:00:00 logger
25500 pts/1    00:00:00 ps
[root@esx01] # kill -9 25320
[root@esx01] # kill -9 25323
[root@esx01] # ps
  PID TTY          TIME CMD
25197 pts/1    00:00:00 bash
25500 pts/1    00:00:00 ps
[root@esx01] # exit
[user@test] $

While the problem highlighted above may not be a concern for everyone, it is a concern for me. The reason this is a concern for me is because I script everything and being unable to exit a SSH session cleanly after restarting management services makes automating the restarting of management services very difficult. Luckily, I have a solution.
In order to solve this problem, I looked into the /etc/init.d/mgmt-vmware script to see what was spawning the child processes. A simple search for the term “Starting“, a term seen when restarting management services, pointed me in the right direction. From there, I noticed that processes were started by the “vmware_start_hostd” function. Searching for this function in the script, I found that it called the following command: vmware_bg_exec “$hostdName” “$watchdog” -s hostd -u 60 -q 5 -c $hostdSupport “$hostd -u”. Based on the SSH link above as well as the OpenSSH FAQ, I appended “</dev/null” to the end of the command so it read: vmware_bg_exec “$hostdName” “$watchdog” -s hostd -u 60 -q 5 -c $hostdSupport “$hostd -u” </dev/null. With this modification made and saved to the script, I restarted the management services, verified the background child processes no longer existed, and confirmed the exit command exited cleanly:

[root@esx01] # ps
  PID TTY          TIME CMD
25197 pts/1    00:00:00 bash
25234 pts/1    00:00:00 ps
[root@esx01] # service mgmt-vmware restart
VMware ESX Server Host Agent Watchdog                  [  OK  ]
VMware ESX Server Host Agent                           [  OK  ]
Starting VMware ESX Server Management services:
VMware ESX Server Host Agent (background)              [  OK  ]
Availability report startup (background)               [  OK  ]
[root@esx01] # ps
  PID TTY          TIME CMD
25197 pts/1    00:00:00 bash
25500 pts/1    00:00:00 ps
[root@esx01] # exit
[user@test] $

I have seen this issue reported from time to time on the VMware communities with no good solution, but many responses point people to the SSH sites I referenced above. What I find interesting is that not all people appear to experience this problem. I can only hope that VMware provides a permanent fix to this issue or publishes a KB article for those experiencing it.

© 2010, Steve Flanders. All rights reserved.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top