I've been in New York City for an astounding 16 meetings in two days. My meetings included a show tell with a few cloud startups to VIP passes to the Kayne West concert last night at Madison Square Gardens. (Yes, cloud computing geeks count as VIP's now). It certainly has been an interesting few days.
A particularly interesting discussion earlier today was with the director of grid infrastructure for a major wall street bank. The conversation ranged from network optimization, the pros and cons of map/reduce to the importance of utilization. During our discussion I couldn't help but think that the traditional single tenant grid infrastructure was dead and that the future lied in the use of flexible and adaptive compute clouds.
Why? It's all about utilization rates. It seems they don't actually paint an accurate picture of a grids computational performance. Typically when a bank attempts to justify why virtualizaton isn't useful in their grid deployments, they point to their utilizations numbers. The common perception is that if their grid is running at 95% utilization, then virtualization isn't going to improve their overall performance, so why bother. But it would seem that utilization numbers don't effectively show the over all system efficiencies or more imporantly inefficiencies. What they do seem to show is that of the utilization of the CPU resources and do little to address areas such as network shaping and I/o optimization which appear to have dramatic impact on overall grid performance.
One of the more exciting aspects of cloud / virtualized grid deployments are in the way you can consolidate or cluster workloads into parallel per machine processes. Sometimes it makes more sense to put 4 VM's on a quad core machine then it does to spread them onto 4 physical machines. This can be particularly important in the rendering of many smaller jobs, that relate to one another. Think risk analysis where a big limitation may be in how quickly you can reassemble the completed jobs. Orders of improvement may only be a fews miliseconds or less, but the savings provided by consolidating the job on to multiple VMs on a single server could be a really big deal multiplied across a grid of 35,000 machines. This type of optimization could mean seconds off the overall risk analysis time and potentially millions of dollars in new investment opportunities.
The problem with traditional grid workload schedulers is they don't make a distinction between a physical or virtual machine. The big opportunities for grid & cloud computing is not just the ability to optimize for scale, but to adaptively optimize for system metrics you never knew you had, until now.