Friday, January 15, 2010

Oversubscribing the Cloud

There's been a bit of a debate raging over whether or not Amazon EC2 has been oversubscribed and is suffering from performance problems because it. The discussion started when Alan Williamson wrote a blog post on Tuesday that said he was experiencing growing performance problems while running a large EC2 deployment for one of his customers. The post accused Amazon of oversubscribing their environment which in turn meant he needed to buy larger instances to maintain the same level of performance in turn increasing his client’s costs.

The debate hits at the heart of complexities involved in trying to deploy cost effective, revenue generating, public use infrastructure as a service platforms. I've been saying this for a while -- one of the hardest parts creating a public cloud service is estimating your customers demand while trying to remain competitive, which really means having prices that are on par or better then Amazon EC2.

Amazon was quick to respond saying “We do not have over-capacity issues. -- When customers report a problem they are having, we take it very seriously. Sometimes this means working with customers to tweak their configurations or it could mean making modifications in our services to assure maximum performance.”

The problem with Amazon's vague response is it does very little to address a potentially major issue. In a sense they're saying we'll help you (if you're big enough) while providing no real insight into how their cloud is built, deployed or run. They do imply there are issues, but not relating to over-capacity, it's the fault of how their customers are deploying on EC2, not how their cloud itself is deployed or run. On one hand Amazon has stated they don't have "over-capacity issues", but on the other hand they are far from saying that they don't oversubscribe their environment. Let's be realistic, how else do you expect Amazon to achieve their ridiculously low price points? The very fact they can offer EC2 at such a low cost is to me indirect proof they do oversubscribe their environment. And hell, why not oversubscribe? In fact I'll go as far as to say that it is a good thing.

Amazon isn't alone in using oversubscribing or overbooking techniques for their service. The concept is common within a variety of industries where multiple users share a common resource. These resources can range from hotel rooms, to airline seats to more technical commodities such as bandwidth, storage, shared servers or even energy. The oversubscription model is dependent on the ratio of the allocated commodity which in turn is estimated on a per user / usage basis. The key is to have a well defined model which accounts for a standard deviation (or how much variation there is from the "average" usage). This typically guarantees the quality of a service for a particular user. Underlying the oversubscription model is the fact that statistically few users will attempt to utilize their full allotment of resources simultaneously. This allows you to offer more resources then you actually have available. The concept applies well to public cloud infrastructure environments, and probably is the most important aspect of any competitive pricing model.

But there are problems with the oversubscription model. The problem occurs because there seems to be a non-linear relationship between the amount of capacity versus the amount of customer demand you have. Or to put it another way, just adding more servers as customer demand increases doesn't necessarily automatically guarantee the same level of service across your cloud deployment, something Amazon's recent dramatic growth & performance issues seems to prove.

This brings us to the concept of a quota's. Have you ever wondered why when you sign up for a "unlimited" cloud infrastructure service such as EC2, you are given an initial allotment of servers? For Ec2 it's something like 20 instances. The reason is simple, the hardest part of an oversubscription model is in capacity planning. That is the use of a quota system is an extremely important aspect in any cloud capacity / resource planning you will be doing when launching and running your own public cloud service.

As an example, for the Enomaly ECP our quota system was developed to provide a predetermined level of deviation across a real or hypothetical pool of customers. Yes, it was developed to allow our hosting / cloud service provider customers to oversubscribe their environments. But it also allows for a variety of pricing & costing schemes to be implemented. Models such as tiers of usage, quality of service tiers, and even the ability to provide additional quota increases for "good behavior", like when you receive an automatic increase to your credit limit on your credit card. Without this type of quota functionality, it is practically impossible to adequately run a revenue positive public cloud service.

So the real question we need to ask Amazon is -- are their oversubscription models keeping up with the growth and scale of the underlying platform? Prove it.

#DigitalNibbles Podcast Sponsored by Intel

If you would like to be a guest on the show, please get in touch.

Instagram