ElasticVapor: Amazon's S3 Gossip Protocol

Sunday, July 27, 2008

Amazon's S3 Gossip Protocol

At 9:41am PDT on July 20th something strange started happening with Amazon's Simple Storage Service (S3). The service used by hundreds of thousands around the globe and millions more through end user web applications was no longer responding. Later it was determined that servers within Amazon S3 were having problems communicating with each other. In responding to this incident Amazon for the first time shed some light on its innerworkings including the use of a gossip protocol which quickly spreads server state information throughout the S3 system.

According to their public statement, This gossip protocol allows Amazon S3 to quickly route around failed or unreachable servers, among other things. When one server connects to another as part of processing a customer's request, it starts by gossiping about the system state. Only after gossip is completed will the server send along the information related to the customer request. On that day Amazon S3 began to see a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. In order to fix the problem they needed to preform a full system restart.

This brings up an interesting question about the use of federated network protocols within cloud services. At eNomaly we have been big fans of use of XMPP for federated communications within our Enomalism cloud platform for multi cloud communications (Wide Area Cloud). XMPP is interesting because it natively solves a number of federation problems within a tried and tested framework. One of the biggest benefits to the use of a gossip protocol lies in the the robust spread of information and the exponential nature of it's sharing of information within a large number of machines.

One such example provided by wikipedia is in a network with 25,000 machines, it's usage can find the best match after about 30 rounds of gossip: 15 to spread the search string and 15 more to discover the best match. A gossip exchange could occur as often as once every tenth of a second without imposing undue load, hence this form of network search could search a big data center in about 3 seconds.

I wonder what are others doing to address federation issues within large scale cloud deployments? And how can we avoid the full system reboot in a worst case scenario?

Reuven Cohen ~ @ruv

An instigator, part time provocateur, bootstrapper, amateur cloud lexicographer, and purveyor of random thoughts, 140 characters at a time.

Reuven is an early innovator in the cloud computing space as the founder of Toronto based Enomaly in 2004 (Acquired by Virtustream in 2012). Enomaly was among the first to develop a self service infrastructure as a service (IaaS) platform (ECP) circa 2005. As well as SpotCloud (2011) the first commodity style cloud computing Spot Market.

Today he leads Citrix (NASDAQ: CTXS) world wide advocacy efforts with a particular focus on increasing the volume, reach and influence of Citrix's extensive portfolio of technology solutions used by more than 260,000 customers and 100 million end users across the globe.

Reuven writes "The Digital Provocateur" column for Forbes Magazine, he is the co-founder of CloudCamp (100+ Cities around the Globe) CloudCamp is an unconference where early adopters of Cloud Computing technologies exchange ideas and is the largest of the ‘barcamp’ style of events. He is also the co-host of the DigitalNibbles Podcast sponsored by Intel

ElasticVapor

Sunday, July 27, 2008

Amazon's S3 Gossip Protocol

#DigitalNibbles Podcast Sponsored by Intel

Instagram

Reuven Cohen ~ @ruv