The Right Architecture, Hardware, and Operations Plan
Several days ago a cloud provider in Australia, Dimension Data, had an outage lasting over 24 hours that was deemed to have been caused by multiple issues with their EMC storage. It’s probable that almost all technology will have failures and outages, so would be unreasonable not to plan ahead for failures. This is why it’s so important to select hardware that meets your performance and availability requirements without over-complicating the situation. Our private cloud, myRacktop, is specifically operated for enterprise businesses with customers across the world, including Australia. MyRacktop is powered by all of our own products so we never have an issue getting support or replacement hardware. In my opinion, that is not what is really at issue here. It’s not that the equipment failed that is the big story, that much is expected. The issue is that the product failure had a domino effect or created a situation that wasn’t quickly recoverable or mitigated.
When reviewing architectures there needs to be redundancy and availability handled at different levels. Critical applications and services that cannot suffer any downtime should be run actively in two locations using separate equipment with the availability handled at the application level. Further redundancy may be implemented at each location at the compute and storage level, if desired. The best example and common practice for this is email with Microsoft Exchange Server. You can run Exchange in a way that operates from two separate data centers and if one goes down completely users will not be affected. Alternatively, if you want to have availability or recovery from the failure of a shared storage or compute node you can consider an N+1 redundancy architecture. If one of the nodes fails and is going to be offline for a prolonged period of time a spare node can take over those services either automatically or manually through some sort of automated process or procedure. Examples of this would be having a total of 5 compute nodes available but being able to handle all of the workload on 4 compute nodes. Solutions such as VMware will automatically load balance and move VMs to recover from a failed compute node as long as there are enough compute resources available. Now this is where, like storage, it doesn’t pay to have each node be too big.
Too Big To Fail
If you make each compute node very large you must maintain excess capacity sitting idle or underutilized. Using smaller nodes allows you to have less unused capacity during normal operations. Think about having the capacity to always handle 100% of the workload in the case of a node failure. If it takes 2 really large nodes that cost $50K each or 10 small nodes at $10K each to do 100% of the work – it will be a lot more cost-efficient to use smaller nodes in an N+1 architecture because you need to have an extra node to handle the failure of a node. So two large nodes means you need 3 which will cost you $150K as opposed to $110K for 1 smaller nodes. This same logic holds true for storage. This is why the scale out vs scale up approach is becoming more popular in mainstream architectures. It is also typically true that even with hyper converged architectures that there will be an interruption in service to the customer/user. By using smaller building blocks fewer people are affected when that inevitable failure occurs.
Turn-key Hybrid Cloud For Business
At RackTop, we designed a cloud that can handle workloads and failures within data centers and across geographies. With our experience we have designed a cloud that is robust and can be leveraged by our customers in a unique hybrid architecture that is simpler to manage and faster to implement than the alternatives. We counsel our clients about how they should architect their IT services so that they are able to leverage the redundancy, resiliency, and availability within the technology. It is incumbent on all IT professional to implement an architecture and strategy that leverages the strengths of the technology and mitigates the risks. It requires a top down approach based on the business need to drive the technology solution. We have successfully implemented 1,000’s of services at varying availability levels to meet the needs of clients both operationally and financially.