VMware Storage vMotion and the story of bypassed ZIL

The subject may not be incredibly clear, so what I wanted to talk about is what happens with storage vMotions common to most VMware environments with NFS-exposed filesystems as datastores on ZFS-based storage.

Reader may already be familiar with NFSv3 client configuration on ESX(i), any recent version of VMware. The nature of this configuration is that O_SYNC and O_DSYNC semantic is tightly enforced by the client, which means that data being committed really must be on stable storage, not in RAM, cache, etc., before VMware will acknowledge that a write event in fact completed successfully. What this means for any system exposing ZFS filesystems to VMware is that Synchronous Writes will have to be contended with. Put simply, what makes write performance good, in any context, is ability to coalesce multiple IOs into small number of groups that can be written as large sequential blocks to disks. What makes write performance bad is not being able to do this.

When a client requires that all writes obey the O_DSYNC and O_SYNC semantics, one basically says that I cannot accept any IO optimization and will live with the side-effects of this.

Unfortunately, there is a lot of monkey business going on today with how this is handled, and this can definitely lead to in best case minimal loss of data and perhaps an unrecoverable VM, and in worst case a major outage, when something unexpected happens on storage, while VMWare is still writing to it.

ZFS has one of the better, in my opinion, ways to deal with this stringent on-stable-storage consistency requirement than most. The ZIL, commonly known as logslog and logzilla, yes, not a typo, is a mechanism by which synchronous writes are immediately committed and tracked, while asynchronous writes could be treated more efficiently. I will ignore for now the fact that there is always a log, for every dataset in a ZFS pool and instead focus on any enterprise-grade, or commercial configuration that may or may not use dedicated devices for log.

Typically, if a dedicated log is configured on a system, the log device must be extremely low-latency, high-IOP, and capable of maintaining this low latency with completely random IO of mixed sizes. Few SSDs on the market are good enough for this, and the devices most commonly used are DRAM-based which eliminates wear-leveling, house-keeping, which could be a performance-killer, etc., commonly found in any SSD. DRAM-based devices are normally byte-addressed, which means there is no sweet-spot for size of an IO request.

So, this ZIL or log device behaves like a very-low-latency buffer for completely random data that is being pushed by a handful or potentially dozens of ESX hosts. Because this IO is mixed together in some fashion, it is always completely random. The ZIL is a stable storage device, designed in a way where a sudden crash will not cause any data-loss. ZFS fully leverages this capability, and is able to treat completely random write-me-right-now requests into effectively asynchronous IO which could be more elegantly coalesced and written far more efficiently, saving you costs in a number of ways, including longer disk lives.

However, there is a bit of a dark side to this story. When VMware is performing a storage vMotion, effectively a relocation of vdisks between datastores, whether on same physical storage system or different, no matter, all IO still has to obey the O_DSYNC and O_SYNC rules, yet now we have massive 64k blocks of random data mixed with IOs from normal operations which are likely of all sorts of varying sizes and megatons of metadata updates, which could be small 512 byte operations. ZFS is smart enough to bypasses the ZIL, if there is not a specifically assigned log device and will instead send all data to final location on disks directly. It so happens, VMware tends to do 64k IO-size when it does storage vMotions. The limit on data size being sent to the ZIL is zfs_immediate_write_sz which defaults to 0x8000 or 32768 bytes, meaning we push all that data directly to disks, without resting to try and coalesce this data in any way. If it were not for VMware’s O_DSYNC and O_SYNC restriction this would still be OK, preferred perhaps, but we still have to commit to stable storage immediately and only then report completion to VMware.

That limit saves us from writing massive amounts of data twice, which is actually what happens when there is no dedicated log devices and IOs are <32K. But we still have to contend with other operations which need to occur at low-latency unlike storage vMotions which are an equivalent of a batch job and should in theory be latency insensitive. We have something along the lines of a Nascar track and caravans of semis hauling goods, while stock cars are competing for position.

This would not be such a problem if you did not care about your VMs remaining online and accessible while storage vMotions happen. Depending upon the size of the system and number of concurrent storage vMotions it is not difficult to completely overwhelm storage and bring your whole environment to its virtual knees.

The more interesting question is whether this is solvable, and the answer is Yes! The right answer is a dedicated log device or devices, again only purpose-built devices are acceptable here. For those who don’t have commercially supported appliances such as RackTop’s BrickStor line of systems there are guides online for how to handle this, including disabling this synchronous behavior, which the writer of this post highly discourages. We of course work with our customers to design and build configurations appropriate for the workload and environment, and include redundant ZIL configurations in any typical VMware environment, as well as whatever tuning is necessary to get the most performance without any increase in risk to data integrity.