Saturday 22 February 2014

Cloud Users: Don't put your eggs in one basket

In my previous blog post I briefly mentioned that a premature cloud account removal meant that the Drizzle project lost a lot of data, including backups.  I'll go into a bit more detail here so cloud users can learn from our mistakes.  I will try to avoid using cloud puns here as much as possible :)

As with everything in life clouds fail.  No cloud I have ever seen so far can claim they have had 100% uptime since the beginning of hitting GA.  However much projects like Openstack simplify things for the operators of clouds they are complex architectures and things can fail.

But there is one element of human failure I've seen several times from multiple cloud vendors which is the problem that crippled Drizzle.  That is account deletion.

Most of Drizzle's websites and testing framework were running from a cloud account using compute resources.  Backups were made and automatically uploaded to the cloud file storage on the same account for archiving.  This was our mistake (and we have definitely learnt from it).  We knew that at some point in the future the cloud accounts used would be migrated to a different cloud and the current cloud account terminated.  Unfortunately the cloud account used was terminated prematurely.  This meant that all compute instances and file storage was instantly flushed down the toilet.  All our sites and backups were instantly destroyed.

This is not the only time I have seen this happen.  There have been two other instances I know of in the last year where an accidental deletion of a cloud account has meant that all data including backups were destroyed.  Luckily in both those cases the damage was relatively minor.  I actually also lost a web server due to this problem around the same time as Drizzle was hit.

The Openstack CI team do something quite clever but relatively simple to mitigate against these problems and continue running.  They use multiple cloud vendors (last I checked it was HP Cloud and Rackspace).  When your commit is being tested in Jenkins it goes to a cloud compute instance in whatever cloud is available at the time.  So if a vendor goes down for any reason the CI can still continue.

I highly recommend a few things to any users of the cloud.  You should:

  1. Make regular offsite backups (and verify them)
  2. If uptime is important, use multiple cloud providers
  3. Use Salt, Ansible or similar technology so that you can quickly spin your cloud instances up again to your requirements at a moments notice

Patrick Galbraith, a Principal Engineer who works with me at HP's Advanced Technology Group is currently working on a way to enhance libcloud to work with HP Cloud better so that it is easy to seamlessly use multiple clouds.  We are also working on several enhancements to salt and ansible.  Both very promising technologies when it comes to cloud automation.

The way I see it, no one should be putting all their cloud eggs in one basket.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.