At 9:41am PDT on July 20th something strange started happening with Amazon's Simple Storage Service (S3). The service used by hundreds of thousands around the globe and millions more through end user web applications was no longer responding. Later it was determined that servers within Amazon S3 were having problems communicating with each other. In responding to this incident Amazon for the first time shed some light on its innerworkings including the use of a gossip protocol which quickly spreads server state information throughout the S3 system.
According to their public statement, This gossip protocol allows Amazon S3 to quickly route around failed or unreachable servers, among other things. When one server connects to another as part of processing a customer's request, it starts by gossiping about the system state. Only after gossip is completed will the server send along the information related to the customer request. On that day Amazon S3 began to see a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. In order to fix the problem they needed to preform a full system restart.
This brings up an interesting question about the use of federated network protocols within cloud services. At eNomaly we have been big fans of use of XMPP for federated communications within our Enomalism cloud platform for multi cloud communications (Wide Area Cloud). XMPP is interesting because it natively solves a number of federation problems within a tried and tested framework. One of the biggest benefits to the use of a gossip protocol lies in the the robust spread of information and the exponential nature of it's sharing of information within a large number of machines.
One such example provided by wikipedia is in a network with 25,000 machines, it's usage can find the best match after about 30 rounds of gossip: 15 to spread the search string and 15 more to discover the best match. A gossip exchange could occur as often as once every tenth of a second without imposing undue load, hence this form of network search could search a big data center in about 3 seconds.
I wonder what are others doing to address federation issues within large scale cloud deployments? And how can we avoid the full system reboot in a worst case scenario?