Cisco MDS CPU utilization of 100 percent seen in certain situations

For those of you who read the MDS 9000 NX-OS update, 5.0(7), release notes, you may have noticed that the title of this post is one of the caveats that was resolved. I figured I would elaborate a bit more on this issue as I was directly involved in discovering it and I am directly involved in ensuring it gets fully resolved.

From the release notes (http://www.cisco.com/en/US/products/ps5989/prod_release_notes_list.html):

CSCts05471

Symptom: CPU utilization of 100 percent is seen in these situations:

When a switch reboots, CLI commands execute very slowly. The zone process shows almost 100 percent utilization for 5 to 10 minutes after the reboot.

When you enter the shut and no shut commands on the Clariion ports, the zone process increases to 100 percent. After 5 to 10 minutes, the 100 percent utilization clears.

Conditions: This symptom might be seen when an EMC Clariion target continuously sends certain FCNS queries that cause the FCNS and the zone process to use the CPU for 15 minutes.

Workaround: This issue is resolved.

Backstory

I had been involved in an architecture project within my team for the last several months back in early 2012. This project involved new storage arrays and the need for high availability as well as backups. After several meetings, it was decided to leverage EMC VNXs with EMC RecoverPoint to meet these requirements.

The new storage devices were tied back to Cisco network equipment and Cisco UCS servers. The environment from a compute perspective was around 200 servers. These servers tied back to pair of Cisco MDS switches for block storage. Once the zoning was put into place for all the equipment, it was quickly discovered that several hosts initiators were not logging into the EMC storage arrays. After verifying the zones were correct, several other workarounds were tried including bouncing the servers as well as the ports on the VNX. With every reboot more and more initiators were discovered.

Concerned with this development, once all host initiators were registered a reboot of the MDS switch was performed. Once the switch was fully back online, it was quickly discovered that several host initiators that were previously logged in were no longer. This resulted in support cases being opened and this MDS fix eventually being released.

How bad is it?

According to the MDS release notes, the high CPU utilization clears in 5 to 10 minutes. In addition, during the time of high CPU utilization CLI commands execute very slowly. The questions this would raise for me are:

What causes the high CPU utilization in the first place?

To answer this question you would need to refer to EMC primus article (emc274449):

Environment: Product: Cisco MDS-9148 Switch
Environment: Product: VNX for Block
Problem: Not all host initiators log back in to a Cisco switch correctly after switch reboot.
Problem:
In very large SAN configurations (such as 150 host initiators per storage processor [SP] port and 64 RecoverPoint initiators), not all of the host initiators log back in to the VNX after a Cisco MDS-9148 switch is rebooted.
Change:
Cisco MDS-9148 switch was rebooted.
Root Cause:
In very large SAN configurations, (such as 150 host initiators per SP port and 64 RecoverPoint initiators), not all of the host initiators log back in to the VNX after a Cisco MDS-9148 switch is rebooted. VNX fabric discovery queries slow down the switch CPU to the point where its name server cannot answer all requests.

The VNX ports change from target only to target and initiator when RecoverPoint Splitter, MirrorView or SAN Copy are installed on the array. When the VNX becomes an initiator, it scans the SAN sending extensive name server inquiries to the Cisco switch. In large SAN environments, you can have so many initiators logged in that the Cisco switch cannot handle the volume of commands from the VNX and times out.
Fix: EMC Engineering has identified several efficiencies in fabric discovery for large SANs. Addressing these efficiencies will require complex architectural changes and will require significant testing. They will not be available until a future software release.
A short term workaround to avoid encountering this issue entails using directly connected MirrorView for replication instead.

RecoverPoint may still be used safely in Cisco SAN configurations with fewer than 100 host initiators.

In summary, a large number of host initiators attached to an EMC storage array combined with an EMC replication technology can result in a large number of name server requests being sent from the replication technology to the SAN switch. Depending on how the SAN switch responses to these requests, it is possible to saturate the switches resources as seen on Cisco MDS switches.

What causes the high CPU utilization to clear in 5 to 10 minutes?

The name server requests that EMC replication products send to a SAN switch must be answered within a predefined timeout value specified in EMC code. If the requests are not satisfied by the timeout value, a retry is triggered and the process starts again. After a predefined number of retries the operation is cancelled. Once the name server DDOS ceases, the switch CPU utilization returns to normal.

Are there any consequences of the high CPU utilization clearing in 5 to 10 minutes?

The primary consequence is that not all of the host initiators log back in to the storage array. Any initiators that fail to log back in would result in dead storage paths and reduced host storage path redundancy. In worst case, assuming reduced path redundancy previously existed (e.g. due to a reboot of a MDS switch for an upgrade), a complete storage outage could be experienced by hosts as a result of all storage paths being dead.

What are the fixes?

EMC RecoverPoint

The real culprit of the massive number of name server requests and initiator logins in my case was the EMC RecoverPoint product. In order to fix the issue, EMC RecoverPoint was removed from the environment. To address the redundancy requirements of the project, other EMC solutions were used in moderation (e.g. MirrorView, SnapView) as well as third-party tools (e.g. Rsync). EMC is currently working on several modifications to the RecoverPoint product including addressing this issue.

EMC CLARiiON/VNX

EMC stated that the underlying code responsible for name server requests and the timeout value for initiator logins had been in place for a long time and to change it would require significant work. In addition, it is believed other SAN switches would not experience the issue that the Cisco MDS experiences (see below for more details).

Cisco MDS

What makes Cisco MDS switch different from other SAN switches? Most SAN switches process operations in parallel. Prior to MDS NX-OS code 5.0(7) all MDS operations were processed in serial with no throttling method in place. As should be expected, operations that are processed in serial take longer than operations that are processed in parallel exacerbating the timeout issue experienced on the EMC side.

Due to the issue of host initiators not logging in and the underlying processes responsible for it, an enhancement request made was to change the MDS to parallel processing. Unfortunately, this fix was not introduced in 5.0(7). Instead, a throttling mechanism was added to prevent the switch from becoming saturated. This fix does not address the issue of handling name server requests in a more timely manner.

Final Thoughts

No product is perfect and no amount of testing can ensure that all cases work as expected. The issue described above would only impact a small percentage of customers (mainly large ISPs and CSPs). The host initiator issue discovered has already resulted in one significant change in a vendors product and more changes are on the way. The backstory laid out above is a great example of why testing combinations of hardware is always important especially for environments that most stay up and handle failures in an expected fashion.

NOTE: I have worked for EMC in the past and indirectly do now through VMware. In addition, I work very closely with Cisco. This article is not intended to put blame or criticism on any vendor. The goal is to educate people about issues that could be experienced, how they are discovered, and how they get resolved. I hope you find this information helpful.

© 2013 – 2014, Steve Flanders. All rights reserved.

Leave a Reply