Over the last two weeks I have been hit by the same UCS bug, though by different means, twice and as such I would like to educate others about it. The issue initially came up after running a ‘show tech’ command on a UCS Fabric Interconnect (FI). Shortly after the process started my session to the FI dropped. Since I have experienced random disconnects from an FI in the past I tried to reconnect. To my surprise the FI was unresponsive. Not knowing what was going on I tried the second FI and it also was not responding. A ping check confirmed my fear, both FIs were down.
For those who have never experienced a dual fabric reboot on an active/production environment before, the ten minutes that follow will be the longest of your life (even if you do have access to the console port – locally or remotely). After about ten minutes the FIs started to respond again. As if a dual fabric reboot was not enough, the problem did not end there. About 5-10 minutes after the FIs came back online they went down again! This cycle continued until manual intervention stopped it.
So what was the problem; what was the impact; how can you fix it; and how can you prevent it?
The character limit for vsh commands in certain NXOS versions is 256. This limit may be exceeded by the ‘trunk allowed vlan’ list and as such will not be treated as a valid command. If utilizing port profiles the character limit issue will result in a system crash when ‘show’ commands are run. As it turns out, this issue is not UCS specific, but is more generally a NXOS problem.
NOTE: If you use consecutive VLANs this may not be an issue for you. For example compare the following:
- switchport trunk allowed vlan 2, 4, 6, 8, 10, 12, 14, 16, 18, 20 <- common configuration
- switchport trunk allowed vlan 2-11 <- same number of vlans, however 34 fewer characters due to consecutive VLAN range
Unfortunately, there is no easy way to check if you are impacted by this bug. What I mean by this is you cannot issue a ‘show version’ command to check if the version of code you are running is impacted as it may trigger the bug. In addition, you cannot issue a ‘show run’ command to check the number of VLANs you have trunked to a switchport as this may also trigger the bug.
WARNING: Port profiles can also cause fabric reboots during firmware upgrades (CSCtu14851).
It really depends on your configuration. In a worst case scenario, if you are running a boot from SAN environment then it is highly likely that your entire environment is down. If you are running in a virtual environment utilizing non-local (FC, FCoE, iSCSI, NAS) storage then it is highly likely that at least some portion of your VMs will be unavailable after the issue is resolved.
|NEXUS||FOUND IN||FIXED IN|
Upgrade to a version that addresses the character limit if one exists, otherwise do not use port profiles. If you are unsure if you are vulnerable to the issue, do not run any ‘show’ commands prior to deleting port profiles (do not run ‘show’ commands to look for existing port profiles).
In the case of UCS, if you have port profiles delete them. Doing so will prevent the issue in the first place. In addition, deleting port profiles is required to stop the continuous reboots should you experience the bug. You can check for port profiles by looking under the VM tab on UCSM. The problem with this workaround in the case of UCS is that there are bugs that result in port profiles being created without user knowledge. For example, whenever a vNIC template is created by default the ‘vm’ type is selected. If the ‘vm’ type is not deselected, a port profile is created (CSCtx95937). Even if you delete existing port profiles, if you modify the VLANs connected to a vNIC template the port profiles are recreated (CSCty44110). The latter bug is one I personally uncovered this past week. After being confirmed by Cisco TAC, a new bug was created for it. This means that prior to running any ‘show’ commands one should verify that port profiles for vNIC templates consisting of high numbers of non-consecutive VLANs are deleted.
Now I understand that the actual causes of the bug are the combination of a character limit and the use of port profiles. However, it is the trigger of a ‘show’ command that I feel is most relevant to highlight. Unfortunately, this is not the first time I have seen, heard, or personally experienced a ‘show’ command causing a fabric outage (see: https://sflanders.net/network/show-vlan and expect another post soon). This is a big concern for me working in an operations team as ‘show’ commands are unprivileged. Commands that can alter or otherwise cause undesired results on a device require an additional level of access and/or authentication. ‘Show’ commands are read-only commands and as such should not fit into either of the categories necessary to require elevated privileges. This means that any user who can authentication with the device is able to run ‘show’ commands. (For the system administrators reading this post, this is equivalent to running the ‘df’ command on a Linux system.)
For those in operations, I cannot stress enough the importance of a staging/pre-production environment that reflects your production environment as accurately as possible to prevent issues like the aforementioned bug. Not only are hardware and configuration important, but also are any tools that you may be leveraging for access, change, and configuration management as well as monitoring. For example, I have run RANCID, a configuration management tool, in many environments. The tool leverages ‘show’ commands in order to collect information about the devices it monitors. If I did not deploy this tool in my staging/pre-production environment then it is possible I could hit the bug mentioned in this post in production and not in staging/pre-production.
© 2012, Steve Flanders. All rights reserved.