Dedicated ZFS Intent Log (aka SLOG/ZIL) and Data Fragmentation
A week or so ago I had a good discussion with a rather brilliant colleague of mine about how synchronous IO is written out to the pool. A bit of a background is in order first. We already discussed this in an earlier post, so here’s a quick refresher. Synchronous IO in ZFS is not treated in the same way as Async. IO, due to ZFS’s transactional nature. Async IO is easy to handle, because it does not require us to have data on the disk before we can continue to accept more IO. Async IO is effectively buffered in the ARC (system memory) and is written out to the disk in a nice and orderly transaction group. This is sometimes called IO coalescing. This is a good thing, leading to more contiguous on disk data and much smoother performance. Synchronous IO breaks this model. When a client is expecting data to be committed to disk immediately we cannot place it into cache and tell the application that all is good and the data is safe. This semantic is in disagreement with the idea of transactions and coalescing IO.
The fact that performance of the overall storage system degrades quite rapidly is no secret. This is something that is very easy to observe. One good method to demonstrate this is by changing a dataset’s SYNC property to disabled. This will very rapidly improve performance by orders of magnitude, because ALL IO is treated as Async IO. This is a dirty cheat, which may be very useful in some specific cases. Doing the opposite and setting this property to treat ALL IO as Synchronous IO will give a very good example of how quickly one can overrun their system which seems to run like a champ under normal circumstances. To be clear, this all assumes that no dedicated ZIL device is present in the pool.
The unintended side-effect of not having a dedicated ZIL is fragmentation. My colleague and I both had a look at the source to get some idea for exactly what happens when IO is Synchronous and we realized that it is not treated in any special way, just written out to the disk using same methods to find free space as would be used for data written as a normal transaction commit. The bits are written out immediately and then wrapped into a transaction group to maintain consistency in transactions. All data has to belong to a transaction, which allows for all the other great features in ZFS, such as snapshots, clones, ability to roll the pool back to some earlier transaction number, etc. However, lost efficiency in IO coalescing leads to added fragmentation, made worse by using smaller block sizes, or small record sizes with filesystems exposed through NFS for example. The smaller the chunks of data the more fragmented the pool will become over time. I will readily admit that we have not studied this closely, so I cannot provide a solid empirical evidence and numbers backing it up. A study of this is very much needed. We know this happens, but just how bad it is we cannot easily tell. The degree of fragmentation will also vary, and the factors are many.
Having a dedicated ZFS Intent Log protects from this fragmentation by avoiding writes outside of transaction flushes. We want as much order in synchronizing transactions to the disks as possible. Disks do not like to seek, and consistent transactions reduce seeking significantly. The point to take away here is that there are no formal tools to defragment a fragmented pool. There are ways to do it, but they are not straight forward nor something we can easily control. We always recommend deploying a dedicated ZIL, for several reasons. This is just one more reason why the added cost of dedicated ZIL is well worth it. I created an interesting diagram which shows various parts of the code where a lot of the ZIL magic happens. zil_create is the function at the center of the diagram: zil_create Butterfly Graph.