Over the last few weeks I was engaged in an issue where all VMs backed by a NFS datastore in an environment experienced several seconds to over two minutes of latency (i.e. inability to write to the filesystem) at approximately the same time. The write delay was so significant that it resulted in VMs crashing.
The configuration at a high-level consisted of ESXi hosts with NFS datastores connected to a pair of VNXs. The VNXs consisted of SATA drives and per EMC best practices was configured as RAID 6. In addition, FAST VP was enabled, which requires the use of storage pools instead of RAID groups. It should be noted that storage pools are currently stated as the best practice configuration as they ease administration and configuration as well as allow for more advanced options like FAST VP. The network between the devices converged to a pair of core switches.
Based on the symptoms it seemed logical to rule out ESXi as the hosts were spread over different switches, different vCenter Servers, and were running different versions of ESXi. Since the storage arrays were both impacted at approximately the same time it also seemed logical to rule out the storage arrays. This left the network and specifically the core switches. The issue was the core switches had been running stable and with the same software version and configuration for some time.
So what was going on?
The root cause was slow IO to backend disks due to a FLARE code implementation of a single threaded R6 backfill operation within a storage pool environment. The single-threaded nature of this algorithm causes other IO operations to queue up while the backfill operations are taking place. Those queued IO operations include host IOs, and internal array TLU/DLU operations which become cumulative and eventually slow the host IOs to greater than 60 seconds in some cases.
RAID operations for RAID 5 and RAID 6 will constantly attempt to perform full stripe writes in order to create more performance efficiency. In order to perform full stripe writes when only partial cache pages are being written, a backfill operation is performed in which READ operations from the disk/s occur in order to fill the cache page. These are known as backfill operations. With RAID 6, the CACHE driver hands off this operation to the RAID driver which performs single-threaded READ operations. This locks the cache page until all read operations complete. Under intense workloads creating lots of backfill operations (only seen via Ring Buffer Analytics or RBA traces), this symptom of slow backend writes can occur.
The fix comes in Block OE Rev 32 and higher (also known as Inyo) code. With this code, the backfill operations have been changed to what is done in RAID 5 backfill operations. With this newer code, the backfill operation is not handed off to the RAID driver. Instead, this operation remains with the CACHE driver which in turn multi-threads the READ operations in parallel. The results are much faster backfill operations and faster overall WRITE performance.
Unfortunately, a primus case does not exist for this issue and is listed in the FLARE 32 (INYO) release notes as “RAID 6 performance improvements”. Please note that I am not advising that you upgrade to FLARE 32 (Inyo) because of the issue above. As with any upgrade you should consult with your vendor’s professional service team and perform extensive testing prior to rolling out any upgrade in a production environment.
© 2013, Steve Flanders. All rights reserved.