IO Queue Hypotheses and Review of Collected Data

This is the second post in a mini-series of sorts, which I decided was better than writing a massive thesis on the subject of queueing, in particular as it relates to storage. While I am collecting observations using tools which are specific to BrickstorOS and RackTop products, my goal is to make this completely generic and applicable to any storage system, or for that matter any system that has and uses storage, which of course is pretty much any computing device. Of course, this may be less applicable to your smart phone than your computer which has actual disk drives or more likely SSDs at this point. As such, your mileage may vary.

In my last post I gave a summary of queueing and shared my assumptions and expectations around natural characteristics of queueing and what changes we should be expecting as queue sizes increase, generally speaking as a result of the underlying system getting busier. What I am hoping to do with this post is to expand a little bit on a few details around my experiments and share a few measurements that I collected/observed as I have been doing this hypothesis validation. My hope is that I can describe in some detail what I would expect to see as well as whether or not my expectations are being reflected in the data.

Let’s make sure we understand a little bit about configuration. I am using fio, which is a load generation tool to generate a very synthetic, workload. What this means is that it really does not model the much more organic IO patterns resulting from usage of shared storage systems. But, for this research I did not feel the actual IO patterns really matter all that much, what matters more is understanding behavior under fairly constant and steady load first, then drawing conclusions in attempt to reinforce these conclusions using information collected from other systems that have organically generated IO load instead of a load generator like fio. The hardware underlying this test is a fairly powerful system with pooled storage. There are 14 drives in total. Again, for the purposes of my research this does not really matter. The more disks the more IO could be done, but I do not expect that characteristics of software and logic in the drives would change in any way. There is of course a difference between pooled storage and single disk, so I will refrain from any attempts to extrapolate my observations to a single drive.

Unlike with traditional shared storage where we generally assume a network accessible system, my testing is all local, this reduces complexity that networks contribute and makes it easier to observe the system without having to think about the layers networking adds. I am collecting all data using dtrace, which is a profiling machinery built into our storage system. Dtrace exists in public space as well, and all scripts I used are portable, with some level of effort, at least to other systems that have dtrace capability and use ZFS as the filesystem.

Here I just want to share some data directly out of dtrace, and these are histograms flipped sideways, which works better with the unix shell. These data are all textual of course, which inherently limits some of the presentation capabilities.

First of all, I wanted to make sure IOs issued were of the expected kind. What I mean by this is that I expected there to be mostly Async Writes and Sync Reads. An Async Read is basically a prefetch operation, in other words read-ahead or caching. Usually when IO is highly random read caching is not meaningful. In this case IO is highly random, by design. This first histogram shows that most data was assigned value 3, which means Async Write and second largest is 0, which is Sync Read. So far so good.

           Distribution of IOs by ZIO priority [p01]
           SyncRD => 0 | SyncWR => 1 | AsyncRD => 2 | AsyncWR => 3 | Scrub => 4
           value  ------------- Distribution ------------- count
             < 0 |                                         0
               0 |@@@@@@@                                  187433
               1 |                                         2424
               2 |                                         8898
               3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        902556
               4 |                                         0

Next, I wanted to know how many items are in the Run Queue of each drive. This should be roughly identical for each drive in the pool, give or take. If there are problematic drives, this will certainly be skewed, otherwise we should assume roughly uniform distribution. Specific to ZFS there is a pipeline process which handles IOs taking into account their priority. Each IO can have one of a five different priorities. Sync and Async Write, Sync and Async Read and Scrub are priorities that we would expect to see. These queues are ordered from most to least critical and each time queues are processed, IOs are retrieved from most critical queue first and least critical queue last.

These queues are balanced in such a way to make sure that they all get the right level of attention, and there is a mechanism to bias in favor of any one of these. One might choose to bias if their system must give preference to sync writes over anything else, for whatever reason, etc. Queues have a minimum and maximum number of IOs that could be issued from each to a device. As IOs are issued from these queues, they reduce in size for brief moments and then grow again as new IOs come in. In general I expect to see a pretty wide distribution queued IO counts, even under fairly steady load.

Measuring size of queue for each individual drive reveals that commonly only a single item is queued, and this makes sense intuitively. Disk drives really can only do one thing at a time, and as such it makes it less practical to stuff the queue when device is the controlling factor. Sure drives support deep queues now-a-days, but we can generally fill these queues faster than drives can process them, and so keeping drive queue fairly low makes sense. This also allows us to spend time and order the wait queue so as to reduce seeking as much as possible, which improves completion latency overall.

           Run Queue Size (by VDEV) Avg: 1 Max: 12 StdDev: 2
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@                               23910
               1 |@@@@@@@@@@@@@@@@@@@@@                    47621
               2 |@@@@@@                                   13941
               4 |@                                        3341
               8 |@                                        3014
              16 |                                         0

Based on configuration of the system, we should not see much greater than 1000 active IOs at any given time. The way we measure number of IOs at any moment is by counting IOs across the priority queues in ZFS and summing them up. Below is a distribution resulting from this counting. As we can see there are periods when there aren’t any IOs queued, and then we see IOs that fall into the 1024 bucket, which really means they are all values that did not fit into former bucket, which is 512. In other words 1024 bucket accounts for measurements >512 and 1024< .

Average seems to be at 320, but this average is skewed by the longer tail to the left of the mean. What is evident here is that the average falls between 256 and 512 buckets, and there is only one more non-zero bucket above 512, yet there are 9 buckets with non-zero values below 256. This average value is less meaningful because it is getting skewed by large observations.

           Wait Queue [p01] Size Avg: 320 Max Size: 1396 StdDev: 354
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@                                      68887
               1 |@                                        16158
               2 |@                                        25974
               4 |@                                        39928
               8 |@@                                       60728
              16 |@@@                                      87116
              32 |@@@@                                     111287
              64 |@@@@                                     99211
             128 |@@@@                                     113562
             256 |@@@@@@@                                  181826
             512 |@@@@@@@@@                                235202
            1024 |@@                                       61432
            2048 |                                         0

I did not expect to see very normal looking distribution, but when standard deviation is larger than the mean, we cannot make various assumptions that we could make about the data if the distribution appeared normal. For example, we cannot with any confidence say that 95% of data is within 2σ of the mean value. This histogram does suggest that there is a lot of variability, and this is probably in large part due to dynamic nature of systems.

I expected this distribution to be a little more normal, with peak somewhere closer to the middle of the data, but clearly here the peak is actually in the tail. My reasoning here is based on notion that as drives get busier their performance is at some point negatively affected, which likely results in fairly frequent reduction in performance until we reach the maximum possible size of queue, which on histogram above is represented by the 1024 bucket. Most of the time it looks like we have between 512 and 1024 IOs in wait queue.

I expected that IO is naturally bursty, in that we are unable to sustain a long run queue for more than brief periods of time. This largely depends on the type of IO, IO size, cache success, etc. Run Queue Count to Run Queue Latency Relationship chart was produced to support or to invalidate my thinking. What this graph seems to suggest is that Run Queue goes down in size with increasing latency. Increasing latency should mean that less and less is getting done, because each operation demands more time. If less and less is getting done, run queue size should decrease and this should result in wait queue size growing.

More basically, one of the questions I asked is whether I can rely on metrics like the mean, and on normality of the data and the answer is it depends. I found that as expected, once enough data points of count measures have been collected, CLT (Central Limit Theorem) kicks in and even though distribution within each sample interval is quite different from other intervals, averaging out thousands of such samples results in a fairly normal looking distribution. Some data appear to actually be fairly normally distributed. Approximately Normal Distribution of IO Counts shows this by comparing the normal curve with our data. The histogram is a bit jagged, but data is quite obviously supportive of normal distribution. The Latency Overlay on Distribution of IO Counts chart also supports the argument that most work is done when latency is lowest, and we can see that here with Red and Olive datapoints being two lowest latency quantiles accounting for large portion of recorded observations.

More robust methods are needed to correctly make sense of some of the data. This feels like a good learning and tool validation. Visualizing this data, going beyond just the means and ranges helps to get a better sense for distributions and idiosyncrasies of the measurements.

There is much more to be said, and my goal for this post was to share my thinking process and give myself an opportunity to think through what I should be doing next as well as give some details about what I am seeing so far. In future posts I will get into much more detail about what I believe the data is telling us.