ElasticVapor: Offline Cloud: Google says sorry for outage

It appears to be the summer of the cloud outage. After a significant gmail outage earlier this month. Google has come out with a number of improvements to their customer service and SLA. In an interesting turn of events, they seem to be taking a page from the Amazon Web Services playbook by offering a cloud dashboard to provide users with up to the minute system status information. It's nice to see Google starting to pay attention to their "paying customer" base.

I should also note Microsoft has done a particularly good job with their new cloud dashboard.

Here is the email Google sent to "paying" Google apps users.

We're committed to making Google Apps Premier Edition a service on which your organization can depend. During the first half of August, we didn't do this as well as we should have. We had three outages - on August 6, August 11, and August 15. The August 11 outage was experienced by nearly all Google Apps Premier users while the August 6 and 15 outages were minor and affected a very small number of Google Apps Premier users. As is typical of things associated with Google, these outages were the subject of much public commentary.

Through this note, we want to assure you that system reliability is a top priority at Google. When outages occur, Google engineers around the world are immediately mobilized to resolve the issue. We made mistakes in August, and we're sorry. While we're passionate about excellence, we can't promise you a future that's completely free of system interruptions. Instead, we promise you rapid resolution of any production problem; and more importantly, we promise you focused discipline on preventing recurrence of the same problem.

Given the production incidents that occurred in August, we'll be extending the full SLA credit to all Google Apps Premier customers for the month of August, which represents a 15-day extension of your service. SLA credits will be applied to the new service term for accounts with a renewal order pending. This credit will be applied to your account automatically so there's no action needed on your part.

We've also heard your guidance around the need for better communication when outages occur. Here are three things that we're doing to make things better:

We're building a dashboard to provide you with system status information. This dashboard, which we aim to make available in a few months, will enable us to share the following information during an outage:

A description of the problem, with emphasis on user impact. Our belief is during the course of an outage, we should be singularly focused on solving the problem. Solving production problems involves an investigative process that's iterative. Until the problem is solved, we don't have accurate information around root cause, much less corrective action, that will be particularly useful to you. Given this practical reality, we believe that informing you that a problem exists and assuring you that we're working on resolving it is the useful thing to do.

A continuously updated estimated time-to-resolution. Many of you have told us that it's important to let you know when the problem will be solved. Once again, the answer is not always immediately known. In this case, we'll provide regular updates to you as we progress through the troubleshooting process.

In cases where your business requires more detailed information, we'll provide a formal incident report within 48 hours of problem resolution. This incident report will contain the following information:

a. business description of the problem, with emphasis on user impact;
b. technical description of the problem, with emphasis on root cause;
c. actions taken to solve the problem;
d. actions taken or to be taken to prevent recurrence of the problem; and
e. time line of the outage.

In cases where your business requires an in-depth dialogue about the outage, we'll support your internal communication process through participation in post-mortem calls with you and your management team.

Once again, thanks for you continued support and understanding.

Sincerely,
The Google Apps Team

ElasticVapor

Wednesday, August 27, 2008

Offline Cloud: Google says sorry for outage

#DigitalNibbles Podcast Sponsored by Intel

Instagram

Reuven Cohen ~ @ruv