ElasticVapor: Failure as a Service

This weekend we had one of those "I should have known better moments". For the last few years we've hosted our primary and secondary DNS servers at The Planet. Around 5pm on Saturday our data center literally blew up. Even though most of our application servers are hosted at Amazon EC2 or in house, this one relatively minor point of failure managed to take down our entire IT infrastructure. We mistakenly assumed that the chances of both DNS servers going offline at the same time were slim and up until today, we had assumed correctly.

This disaster is especially difficult for me since I spend my days pitching the merits of geographically redundant cloud computing which I call "failure as service". The concept goes like this; If you assume you may lose any of your servers at any point in time, you'll design a more fault tolerant environment. For us that means making sure our application components are always replicated on more then one machine, preferably geographically dispersed. This way we can lose groups of VMs, physical machines, data centers, or whole geographic regions without taking down the overall cloud. This approach in a lot of ways is similar to the architecture of a P2P network or even a modern botnet which rely heavily on a decentralized command and control.

As an early user of Amazon EC2 we quickly learned about failure, we would routinely lose EC2 instances and it became almost second nature to design for this type of transient operating environment. To make matters worse for a long time EC2 had no persistent storage available, if you lost an instance, the data was also lost. So we created our own Amazon S3 based disaster recover system we called ElasticDrive.

ElasticDrive allows us to mount amazon s3 as a logical block device, which looks and acts like a local storage system. This enables us to always have a "worst case scenario" remote backup for exactly this type of event, and luckily for us we lost no data because it. What we did lose was time, our time on a Sunday afternoon fixing something that shouldn't have even been an issue.

In our case our application servers, databases and content had been designed to be distributed, but our key point of failure was in our use of a single data center to host both of our name servers. When the entire data center went offline, so did our dns servers and so did our 200+ domains. If we had made one small, but critical change (adding a redundant remote name server) our entire IT infrastructure would have continued to work uninterpreted. But when I awoke Sunday morning (to my surprise) everything from email, to our web sites, to even our network monitoring system failed to work.

I should also note that recently Amazon has worked to overcome some of the early limitation of a EC2 with the inclusion of persistent storage options as well as something they call Amazon EC2 Availability Zones. They describe availability zones as: "The ability to place instances in multiple locations. Amazon EC2 locations are composed of regions and availability zones. Regions are geographically dispersed and will be in separate geographic areas or countries. Currently, Amazon EC2 exposes only a single region. Availability zones are distinct locations that are engineered to be insulated from failures in other availability zones and provide inexpensive, low latency network connectivity to other availability zones in the same region. Regions consist of one or more availability zones. By launching instances in separate availability zones, you can protect your applications from failure of a single location."

Well Amazon, if you were looking for a "Use Case" look no further, Cause I'm your guy.

I've learned a valuable , if not painful lesson. No matter how much planning you do, nothing beats a geographically redundant configuration.
----
If anyone is interested in the learning more about the issues at the planet. (9000 servers offline)
http://tech.slashdot.org/article.pl?sid=08/06/01/1715247

Or EC2 Availability Zones

http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347

ElasticVapor

Monday, June 2, 2008

Failure as a Service - Cloud Redundancy

#DigitalNibbles Podcast Sponsored by Intel

Instagram

Reuven Cohen ~ @ruv