While I have seen people discuss this error message and solution, I figured it would be a good idea to discuss in terms of specific configurations such as on Cisco hardware and VMware virtualization. I feel this is important to understand the implications of the error message and to express the importance of BIOS configurations.
First, the issue: Cisco UCS B230-M2 blades (dual 10-core = 20 ‘processors’) running ESXi were throwing processor halted log messages. While this in itself may or may not be an issue, under little load via VMware clone operations ESXi hosts were disconnecting from vCenter Server (vCS) and becoming unresponsive for several minutes. Further digging uncovered that when the ESXi host disconnected from vCS the logs shows that all processors on the host were halting at exactly the same time.
I had another environment with Cisco UCS B200-M2 blades (dual 6-core = 12 ‘processors) running ESXi and neither the processor halted log messages nor the disconnected hosts were experienced. Based on this, I decided to look into things that I could control mainly UCS service profiles and began comparing B200-M2 BIOS settings to B230-M2 BIOS settings. I quickly discovered that the B230-M2 BIOS settings had options for C and C1E state that defaulted to on while the B200-M2 did not have these options at all. For those interested, C state options are used for energy savings. For more information check out: http://techreport.com/articles.x/7998/2. In addition, I knew that VMware made mention that C state options could cause problems (source: http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.1.pdf). As such, I decided to disable these options and reboot the blades so the changes could take effect. Upon retesting, the log messages and the disconnects stopped!
Now, I was lucky in a couple ways:
- I only had a small number of B230-M2 blades and they were not in production so making and rolling out changes to the BIOS were not an issue. This, however, raised some important questions for me including:
- Why was there no mention of this from Cisco? I later learned that this is an Intel enforced default.
- Why are there not application specific BIOS defaults (e.g. VMware)? At the very least, a best practice guide would be extremely beneficial. Another example of this would be that serial ports are enabled by default in the BIOS while VMware’s best practice is to disable them if they are not needed, which in most cases they are not on blades.
- I happen to be running a UCS version >2.0. I later tested this on UCS 1.4 and learned that while the same error messages and disconnects are experienced, there is no option via the BIOS service profile to configure C and C1E states. This means in order to disable them you must manually do it via a console connection to each host (i.e. there is no automated way)!
As you can see, BIOS configurations are extremely important. I began work on a BIOS Optimization Guide, which I called vBOG awhile back and will update it to reflect this important discovery.
NOTE: Your mileage may vary. Be sure to test any changes to your environment. I am not responsible for any issues 🙂
UPDATE: Be sure to check out Cisco’s Bug Toolkit CSCtz16082 as there is an open issue, which results in the C1E setting not being disabled unless you use the default or custom default BIOS options.
© 2012, Steve Flanders. All rights reserved.