Sunday, March 1, 2009

Navigating the Fog -- Billing, Metering & Measuring the Cloud

It's that dreaded time of the month again, the time of the month that we, the 400,000+ Amazon Web Service consumers await with great anticipation / horror. What I'm talking about is the Amazon Web Services Billing Statement sent at beginning of each month. A surprise every time. In honor of this monthly event, I thought I'd take a minute to discuss some of the hurtles as well as opportunities for Billing, Metering & Measuring the Cloud.

I keep hearing that one of the biggest issues facing IaaS users currently is a lack of insight into costing, billing and metering. The AWS costing problem is straight forward enough, unlike other cloud services Amazon has decided to not offer any kind of real time reporting or API for their cloud billing (Ec2, S3, etc). There are some reporting features for DevPay and Flexible Payments Service (Amazon FPS) as well as a Account Activity page, but who has time for a dashboard when what we really want is an realtime API.

To give some background, when Amazon launched S3 and later EC2 the reasoning was fairly straight forward, they were a new services still in beta. So without officially comfirming, the word was a billing API was coming soon. But 3 years later, still no billing billing API? So I have to ask, what gives?

Other Cloud services have done a great job of providing a real time view of what the cloud is costing you. One of the best examples is GoGrid's myaccount.billing.get API and widget which offers a variety of metrics through their Open Source GoGrid API.

Billing API's aside, another major problem still remains for most cloud users, a basis for comparing the quality & cost of cloud compute capacity between cloud providers. This brings us to the problem of metering the cloud which Yi-Jian Ngo at Microsoft pointed out last year. In his post he stated that "Failing to come up with an appropriate yardstick could lead to hairy billing issues, savvy customers tinkering with clever arbitrage schemes and potentially the inability of cloud service providers to effectively predict how much to charge in order to cover their costs."

Yi-Jian Ngo couldn't have been more right in pointing to Wittgenstein's Rule: "Unless you have confidence in the ruler's reliability, if you use a ruler to measure a table, you may as well be using the table to measure the ruler."

A few companies have attempted to define cloud capacity, notably Amazon's Elastic Compute Cloud service uses a EC2 Compute Unit as the basis for their EC2 pricing scheme (As well as bandwidth and storage) Amazon states they use a variety of measurements to provide each EC2 instance with a consistent and predictable amount of CPU capacity. The amount of CPU that is allocated to a particular instance is expressed in terms of EC2 Compute Units. Amazon explains that they use several benchmarks and tests to manage the consistency and predictability of the performance from an EC2 Compute Unit. One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. They claim this is the equivalent to an early-2006 1.7 GHz Xeon processor. Amazon makes no mention of how they achieve their benchmark and users of the EC2 system are not given any real insight to how they came to their benchmark numbers. Currently there are no standards for cloud capacity and therefore there is no effective way for users to compare with other cloud providers in order to make the best decision for their application demands.

An idea I suggested in a post last year was to create an open universal compute unit which could be used to address an "apples-to-apples" comparison between cloud capacity providers. My rough concept was to create a Universal Compute Unit specification and benchmark test based on integer operations that can form an (approximate) indicator of the likely performance of a given virtual application within a given cloud such as Amazon EC2, GoGrid or even a virtualized data center such as VMWare. One potential point of analysis cloud be in using a stand clock rate measured in hertz derived by multiplying the instructions per cycle and the clock speed (measured in cycles per second). It can be more accurately defined within the context of both a virtual machine kernel and standard single and multicore processor types.

My other suggestion was to create a Universal Compute Cycle (UCC) or the inverse of Universal Compute Unit. The UCC would be used when direct system access in the cloud and or operating system is not available. One such example is Google's App Engine or Microsoft Azure. UCC could be based on clock cycles per instruction or the number of clock cycles that happen when an instruction is being executed. This allows for an inverse calculation to be performed to determine the UcU value as well as providing a secondary level of performance evaluation / benchmarking.

I'm not the only one thinking about this, One such company trying to address this need is Satori Tech with their capacity measurement metric, which they call the Computing Resource Unit (“CRU”). They claim that the CRU allows for dynamic monitoring of available and used computing capacity on physical servers and virtual pools/instances. The CRU allows for uniform comparison of capacity, usage and cost efficiency in heterogeneous computing environments and abstraction away from operating details for financial optimization. Unfortunately the format is a patented and closed format only available to customers of Satori Tech.

And before you say it, I know that UCU, UCC or CRU could be "gamed" by unsavory cloud providers attempting to pull an "Enron", this is why we would need to create an auditable specification which includes a "certified measurement" to address this kind of cloud bench marking. A potential avenue is IBM's new "Resilient Cloud Validation" program, which I've come to appreciate lately. (Sorry about my previous lipstick on pig remarks) The program will allow businesses who collaborate with IBM to perform a rigorous, consistent and proven program of benchmarking and design validation to use the IBM logo: "Resilient Cloud" when marketing their services. These types of certification programs may serve as the basis for defining a level playing field among various cloud providers. Although I feel that a more impartial trade group such as the IEEE may be a better entity to handle the certification process.
Reblog this post [with Zemanta]

#DigitalNibbles Podcast Sponsored by Intel

If you would like to be a guest on the show, please get in touch.

Instagram