Wednesday, August 27, 2008

Major Storage issues at Flexiscale

I've been holding off reporting this. (Sorry Tony) But now that it has become public knowledge I feel its appropriate to post. Flexiscale has over last 24 hours been having some serious problems with their storage systems. Typically these type of problems relate to some runway process, like in Amazon S3's outage last month where their gossip protocol was to blame. But in the case of Flexiscale it seems that the problem appears to be that of a human error, which was made worst by a poor disaster recovery process. It seems that one of their administrators mistakenly deleted one of the main storage volumes. Now more then 12 hours later flexiscale users have read-only access to the storage platform but no read-write. Simply put, they have to rebuild their arrays, but don't have the space to do so.

Here's what Tony Lucas of FlexiScale had to say.

As some of you are aware, we have been having issues with I/O (disk speed) in recent weeks. We identified short term and long term measures to eliminate these problems. The short team measures involved reorganising how data was stored across our storage network in a more efficient manner, and the long term measure was to increase the overall I/O capacity of the platform.

As a preparatory step to adding additional capacity one of our engineers was reorganising the data structure on the storage network and whilst cleaning up the snapshots we use as our backup process accidentally deleted one of the main storage volumes. This caused an immediate outage to a large amount of our customers

We immediately took action to take the entire disk structure offline (which caused the remaining customers to be taken offline) as it was the only way to preserve the integrity of the data on the system. Work then commenced with our storage vendor to restore this data.

Although we have now successfully gained read-only access to everyones data, a bug in the storage platforms operating system has prevented us from providing read-write access to it. This was discovered at 11pm last night, just when we thought we were about to bring the entire disk structure back online.

After consulting with our storage vendor it was agreed the most sensible option would be to copy the entire volume to a new disk structure (still maintaining it's integrity and structure), from where we could re-mount it correctly. Unfortunately due to it's size we didn't have spare capacity on the platform to create a complete duplicate of it.

An investigation of other ways of restoring the data then was undertaken but all options were considered too risky, and although downtime is a major problem for everyone, we felt the integrity of the data was the most important factor.

The decision was then taken to get additional capacity in from the storage vendor as soon as possible so that we could then increase the capacity to a sufficient level to allow us to copy the volume and successfully restore it. We originally thought we would be able to get this today, but unfortunately it will not arrive until mid-morning tomorrow, although we have done (and will continue to do) everything we can to speed this up.

At this time we are assisting customers who need access to specific files to get this, and we will continue this as long as we can into the night as resources allow.

Tomorrow morning once the storage arrives and is online, we will copy the data across and then begin to restart the entire platform as quickly as possible, but as the system wasn't designed to restart everything at once, this will take time.

We will be offering credits against our SLA, which will be determined once everyone is back up and running, as I'm sure you can appreciate all resources are being focused on that at this moment.

I, and all my staff are well aware of the potential impact this will be causing to you our customers, and we are doing everything we can to help in that respect. We will also be undertaking an investigation to ensure additional safeguards are put in place to prevent this happening again.

Sincerely,
Tony Lucas
Chief Executive Officer
XCalibre/FlexiScale

#DigitalNibbles Podcast Sponsored by Intel

If you would like to be a guest on the show, please get in touch.

Instagram