Guest Bloggers: Bill Marvin and Chris Seib, Co-Founders of InstaMed
Last night, I turned on iTunes Match for the first time and streamed music from iCloud while making dinner. Using the cloud to play music worked great, but it made me wonder: what would happen if the cloud went down and my music was unavailable? For five minutes, or for five hours? I’d be annoyed and inconvenienced, forgetting all about my recent delight and the old way I used to do things. The bottom line is, since only my MP3 music data would be affected, it wouldn’t be a big deal. But today, consumers and businesses are transitioning all kinds of data to the cloud, from MP3 and pictures to mission critical data. And the cloud is not just being used for data storage and retrieval, it’s being used to support business functions, like CRM, accounting, processing functions and cash flow. These business functions, especially cash flow, are mission critical to any business.
What if an error caused your cloud-based system to go down for an hour? Maybe that’s an inconvenience, maybe that’s some lost revenue, or some extra labor costs. But what if it went down for a few days? In most cases, this type of event would impact your business in a material way. While moving to the cloud greatly enhances the way we use data and conduct business, it also presents new risks to consumers and businesses.
The following post, written by Chris Seib, my co-founder and CTO, discusses the best practices all businesses should use to ensure their businesses have “True Availability” when leveraging the cloud or vendor solutions based in the cloud.
– Bill Marvin, President & CEO, InstaMed
Achieving True Availability through Best Practices
In the wake of the data center outages as a result of a recent storm in Virginia, along with other data center failures, it’s important to recognize what went wrong and what best practices could have been applied to prevent long-term disruptions, so your business can ensure its processes and functions have business continuity and true availability. It’s often easy to underestimate the cost of your critical vendors being down, until too late. Worse yet, many vendors talk about reliability but may take shortcuts to save costs, which can have a very significant impact on your business. Here are some best practices and tips you can use in discussions with your current or potential vendor partners.
Local High Availability & Fault Tolerance
Most downtime is caused by hardware failures rather than natural disasters. In fact, between two and four percent of data center grade hard drives can be expected to fail each year (nearly four times as likely as manufacturers will claim). A private cloud data center must be architected at all layers with this in mind to minimize any disruption from these events. This is often referred to as High Availability or Fault Tolerance.
Power & Cooling Best Practices
A private cloud data center should have complete power redundancy. In most cases, this means having two separate, high-priority feeds from the local power company, battery and generator backups, and high-end electrical equipment available to ensure seamless switching between these sources. Many vendors have a simple, low-end Uninterruptible Power Supply (UPS), which may only supply minutes of backup. It’s crucial to have multiple generators with fuel supply contracts so a data center can run indefinitely.
In private cloud data centers, it is critical to have adequate cooling. Cooling systems must be completely redundant, with high fault tolerance. Many data centers only have a single air conditioning unit, which is often insufficient when there is a heat wave.
- Tip: Ask your vendors: When is the last time your backup power supplies were tested? How much downtime is expected if the power company has a complete blackout? How long can they keep services up if there was a complete power blackout? Ask them to prove it by sharing testing results and allowing you to tour their data center facilities.
Hardware Best Practices
When it comes to data center hardware, the rule of thumb is to always have one more than you need “active” (IT people call this N+1). If you need a firewall, you should have two firewalls, and they need to be configured for zero-downtime failover.
Furthermore, there should be no single point of failure; every component must have redundancy. Storage area networks should have redundant drives, hot spares and multiple controllers. All layers of the system must be included, and individual components should have high availability in order to avoid downtime.
Many vendors often claim to have redundancy, but they still show single points of failure that can be exposed and cause extended downtime.
Having multiple pieces of hardware is all well and good, but what is truly important is that these pieces are interchangeable with no customer impact, otherwise known as immediate failover. Many vendors claim to have standby servers or equipment, but it will take hours or days for that new equipment to come online.
- Tip: Ask your vendors to prove that they have complete redundancy of all components and that they regularly test the failover.
At a private cloud data center, it’s important to have proactive monitoring and alerting in place with adequately trained professional IT staff that are familiar with the applications and services. This helps ensure that any issue or degradation is identified early and resolved quickly before any customer impact. Issues will happen; hard drives fail and network issues are common, but in almost all cases there are early warning signs.
- Tip: Ask your vendors to describe their data center monitoring and alerting procedures.
– Chris Seib, Co-Founder & CTO, InstaMed
Click here to read True Availability: Part 2, featuring more best practices for disaster recovery, business continuity and security.