Why I Don’t Trust Averages

traffic-300x300I had a realization recently, after years of looking at performance data in VMware, that in reality latency reported for a datastore, for example, is being averaged out across all IO sizes for every IO issued to given datastore. At first glance this does not strike one as a flawed statistic, after all, we want to know what the latency is and we don’t tend to think in terms of individual IOs, but mostly in terms of time-series and having point values for each interval. For example, our resolution may be 10 seconds, where we have 6 samples for every minute. Each sample is some average number over ALL IOs that occur in that 10 second interval. We could have had a million IOs that happened during that interval, of course.
Naturally, when we aggregate things we lose information as a result. A good real-world example may be counting cars on the road. We can, for example just count every vehicle passing a certain point, and then divide that number over the period of time during which we sampled, and we’d have an average number of cars per minute, per second, etc. What we don’t have is the type/size of vehicle, only a generic number of vehicles. Of course, in reality we could have sedans, light trucks, buses, commercial tractor trailers, etc. Think of these as possible categories of vehicle. Often, it is less useful to know how many total vehicles pass some point during a sample period, and its more useful to know how many of each kind passed. We may be trying to measure stresses that a given point in the road sustains and knowing how many heavy vehicles, those over 3 tons for example, pass during any sample window is key metric, because we know that these will exert greatest stress on the surface. What this turns into is known as distribution. We distribute count of all vehicles across categories of vehicles, which ends-up being far more granular than just a single metric: “total count.” For example, we may determine that over 1 minute we had:

light truck/suv => 10
sedan => 35
bus => 1
freight liner => 4
Clearly, this is more insightful than 50 vehicles counted during a 1 minute sample.

This notion is fairly universal, and completely applicable to our VMware problem. VMware currently reports just the total latency for any given period of time. But this hides a very important fact. We don’t know how this latency is distributed across all the IOs that occurred during that period. For example, at any moment multiple virtual machines running on hosts in your virtual environment are issuing IOs of mixed sizes, with different latency requirements and all are competing for a slice of IO-time from the same shared storage. Some may be applications doing database searches or updates, some may be index scans, some could be reading through ASCII data, etc., etc. At the same time others may be moving gigabytes of data, in other words batch work that we think of in terms of throughput, not in terms of latency. On the other hand, anything database lookup or update related is viewed as latency sensitive action, and we expect those to happen quickly. But how do we know how much of these different kinds of IOs we had and whether or not latency for small, latency-sensitive IOs was indeed lower than that of large throughput-biased IOs? We cannot tell. This data has been lost in the summarization and aggregation of information. Yet, this is vital, and will tell us much more than a general latency statistic.

We should never assume that every IO completes with about same latency as the average reports. This is simply not the case, ever. We should have some confidence that latency-critical IOs happen with lower latency than others less sensitive. Some vendors won’t be able to tell you this, VMware being a good example, though certainly, this data is available in the system, just not presented. Our BrickStor appliance can offer a granular breakdown of this information, and I found that at times this information is invaluable for troubleshooting performance issues.

To drive this point home take a look at the distribution below, first line is a summary statistic.

Total IOs: 1982721 Total KBytes: 79032802 Avg. bytes/IO: 40817
IO size (bytes):
value ————- Distribution ————- count
64 | 0
128 | 14
256 | 0
512 | 0
1024 | 0
2048 | 0
4096 |@@@@@@@@@ 426155
8192 |@@@ 153245
16384 |@@ 110063
32768 |@@@@@@@ 347320
65536 |@@@@@@@@@@@@@@@@@@ 903297
131072 |@ 42837
262144 | 0
This is captured on a test system where a mixed IO-size load is running. The value column has buckets of IO sizes, meaning each IO is sized close to the size of the bucket, like the 4096 byte bucket for example will have IOs larger than 2048 bytes, but smaller than 8192. Notice that we had a number of IOs in the 128KB range, the 131072 byte bucket, yet the summary statistic is reporting an average of roughly 40KB. If all we only had a summary, the fact that we have IOs that are at least 3 times larger than average would have been unknowable. Yet, it is this piece of data that could be key to solving a performance problem, and we may never know.

I find that people often don’t have the sort of respect that they should for distributions, and I hope if some of the readers have similar view, this article is going to change their minds, or give them more reason to consider usefulness of distributions further.