How Not to Evaluate Storage
Over the years it has become clear to me that most people do not do a good job of evaluating shared storage systems, whether a SAN, big or small, a NAS appliance, whatever it might be. I wanted to discuss some of the things we do, that we should not and some of the things that we don’t do, that we really should.
Let me start with the fact that we tend to really like using load profiling tools, most common of which for those using Windows is IOMeter. There is nothing inherently wrong with these tools. They honestly generate a kind of work that we ask them to do. However, the problem is we never generate anything that is in any way representative of real world. What we all have a tendency to do is open up one of the profiles, tweak some knobs and slide some sliders to their midpoint, as though we have a good idea what they mean, and off to the races we go. These tools say NOTHING about what we actually do in our environment. What these tools are good for is doing temporal analysis of a system, already in production, to gauge relative performance at different times of day, week, month, etc. Most environments will have swings from busy to mostly idle periods, like patching weeks, holiday weeks, etc. Analyzing relative differences is something these tools can help us do. For example, they may clue us into the fact that between 5PM and 10PM our storage system has its worst response time, for whatever reason. Beyond this, these tools lack any practical purpose.
Conclusion to be drawn here is, unless your job is to run test after test after test and measure long-term stability and response quality of your storage system, don’t do this.
Some argue that they can faithfully reproduce their entire environments with a load generation tool. I will argue that as long as you have more than ONE thing accessing your shared storage concurrently, you will not model the workload well. If you are lucky and really very well understand what all your systems do, and when, you may have something that can approximate steady state, but beyond that, forget about it, you are playing a game where you cannot win. Reality is, most environments are complex, with hundreds or thousands of systems doing different things, often in different ways. Avoid doing this, it won’t buy you much.
VMware environments or any multi-tenant hosting environment really are a particularly difficult beast, as they mix and blend IO from many, many Virtual Machines, with perhaps a half-dozen operating systems, different applications running on each, with different degree of load at different times of day and night, and all this could be hitting a single storage system. Modeling what happens in these environments is close to impossible. At any moment someone may decide to do a backup of a multi-terabyte database and your entire performance profile drastically changes. We have all seen those sudden IO increases and consequent Everest-like latency spikes, and wondered what happened there!?
Somehow when we evaluate products we tend to think that synthetic tools will help, and of course, we will try and run some profile which we ran before on our existing system and we get some result and it might really look good. We compare this against the result we got from running the same test on our existing production system and can see significantly better numbers. First reaction is, great, this is a significant improvement over my existing system. But, we forget that when we ran this load and got numbers from our existing system, it already had thousands of other things to do. It likely had a lot of “other” stuff in the cache, its queue filling with work from production clients, and we could have been running this test at same time as someone was doing some heavy IO work, which we did not even know was happening. Point is, when we test a system that does JUST ONE THING against one doing hundreds, it is like trying to compare apples and potatoes. Don’t do this.
Speaking of doing just one thing… When we are asked to evaluate a shared storage product, regardless of vendor, more often than not we will setup a “test machine”, one that acts as a client, and we will use it to do all our tests. But, is that what we actually do in our production? I am fairly sure most will say no. Again, this is a shared system, meaning by design it is intended to support multiple clients. We cannot make the assumption that if doing some unit of work takes “X” time, that doing same work from two clients will be 2 * “X” time, and 4 clients will be 4 * “X” time. Shared storage systems do not work this way. Scaling is not linear. They are designed to scale, and are more efficient with more clients doing work concurrently, especially when the work is steadily fed to these systems, vs. everything happening at once and then hours of practically nothing happening. Good example of this might be backups. Do not do this. If you are testing a shared storage system, be it with NFS, iSCSI, SMB, etc., try to have a number of clients doing different things and test some practical thing, such as perhaps a dump of a large database, or some analytical process that is typically very heavy and takes a long time on your current system, etc.
Do not assume that you can compare your production system and your test system side-by-side. Instead, try to identify some reproducible unit of work that will exercise both data and metadata capabilities of a storage system and scale this workload by increasing number of concurrent workers. For example, say you have some large database queries that you know pose a problem on your existing production system. Take this query and try running it concurrently on a number of clients connected to this evaluation system. Repeat this exercise a number of times, each time adding more workers.
Avoid the temptation to use just a single client. Try, as much as possible to add clients. At some point the client is your bottleneck, and most often shared storage system will be more robust at handling IO than any one client can supply. What in effect happens is you starve the storage system, yet think that it is a bottleneck, where in reality the bottleneck is the client. Stop adding clients when you begin to see a drop in the overall response and perhaps steep increase in the duration of this work you are running. If you are evaluating two or more products side-by-side, try the same process on all of them and see which will fare best. Vendors may know some things about why they did or did not perform well on a given test, and you should never assume that one test fairly characterizes a system. Ask the vendor questions. If they cannot give you a reasonable answer think hard about whether their product makes sense for you.
Devise a series of tests, testing various things that you feel are common in your environment. Focus on latency response in some tests, and on ability to move large amounts of data in others. There is no doubt that some systems will do better with one type of work and some with the other. Try to repeat tests, especially those that do a good bit of reading, multiple times. Caches are there for a reason, most recently created data is touched frequently while it is fresh, and as it ages accesses to it slowly dwindle. In real production environments caching is vital and you need to know whether some systems are better at it than others. If you see some system do better after the second or third run of the test, or as you keep increasing concurrency, cache is likely affecting results, and this is good. Don’t try to eliminate cache from your tests, because that’s what some vendors heavily design into their system, because the understand how shared storage environments behave majority of the time.
Everything happens at once, sometimes. The term I have now heard a number of times is “IO blender effect”. What this means is we get some completely unpredictable mix of IO sizes, sequential streams, random accesses and everything in between. In reality no matter how sequential your workload is, as soon as you have two or more of these streams, your workload might as well be 100% random. All shared storage systems must be champs at dealing with completely random, mixed size IO, and we need to test this.
Think about what happens when you reboot a VMware host and with it tons of VMs all begin starting-up, at once. Yes, this is a worst case scenario, and yes this is usually mitigated by staggered start-up, etc., but assume you will have to deal with this, someday. If you are buying a shared storage system, chances are good there will be times when everything happens at once, maybe you host VMs on it, and maybe patches are pushed to hundreds of machines all at once, it happens. Try to evaluate how the system responds when this happens. Does it hit a cliff and everything begins to timeout, or is decline gradual? Don’t do this with just one client, but have a fair amount of clients doing work concurrently. The goal is to reproduce production environment, not some contrived test which never happens in reality.
It is not easy to do this stuff right, but try to focus on measuring useful work, while simulating your production environment as closely as possible. There is not one thing we can do to have a good and meaningful evaluation of a system, but these dos and don’ts should help.