Cloud Zone is brought to you in partnership with:

Arnon Rotem-Gal-Oz is the director of technology research for Amdocs. Arnon has more than 20 years of experience developing, managing and architecting large distributed systems using varied platforms and technologies. Arnon is the author of SOA Patterns from Manning publications. Arnon is a DZone MVB and is not an employee of DZone and has posted 68 posts at DZone. You can read more from them at their website. View Full User Profile

Amazon’s EC2 & EBS outage

05.03.2011
| 6192 views |
  • submit to reddit

Unless you are living under a rock, you’ve probably heard, if not felt Amazon’s outage (you can read more about it all over the web, e.g.  Cade Metz at the register or Julianne Pepitone at CNN money, [edit Apr26] Todd Hoff ‘s list of posts on the subject) This incident is very interesting in the technical sense as well as very  disturbing.

First off it is a major event since Amazon controls almost 60% of the Infrastructure as a Service market (per WSJ) and an incident like this is bad publicity for the whole cloud concept. After all if Amazon is fledgling what does that mean for Rackspace and the other IaaS players – not to mention PaaS players like Google and Microsoft (since PaaS solutions require more complicated software and higher integration with end-solutions.

It is also worth mentioning that this isn’t the major outages for Amazon. One notable “availability event”  occurred in 2008 where S3 had major problem for about half a day and a few minor problem with EC2 in 2009. What’s stands out here is that the availability zones features that was supposed to isolate this type of breakdown to smaller areas broke and only sites that had data-center redundancy (like Netflix for example) managed to handle the interruption while sites like foursquare, reddit and even, it seems,  a company monitoring cardiac arrests

were all harmed.

Another alarming behavior on the part of Amazon is the lack of transparency / bad crisis management in handling the outage. For example Keith Smith CEO of BigDoor (one of the startups affected by the outage) writes

“Starting at 1:41 a.m. PST, Amazon’s updates read as if they were written by their attorneys and accountants who were hedging against their stated SLA rather than being written by a tech guy trying to help another tech guy.”


When a big supplier fumbles a lot of companies are affected and it is bound to get big press. Also, systems, esp. complex ones, have bugs and its understandable that things may break from time to time (and I am sure that Amazon, which evidently has top talent would find a way to prevent this from recurring). However, Cloud providers need to understand that “with great power comes great responsibility” and the technical offering need to be strengthened with great support and transparency.

On the technical level, it also means that, while moving to the cloud carries a lot of benefits and can save a lot on operations costs, the responsibility for our application’s up time is still our responsibility and depending on the  cost of failure (on the scale from few dollars to lives of people)  we should also architect for disaster regardless of vendor claims. When we build closed systems we may  look at the Mean-Time Between Failure (MTBF and MTBCF) advertised by hardware manufacturers but we’d also add software based reliability mechanisms – for the cloud that may mean cross data-center (region) deployments, cross cloud provider deployments, or even on-premise backup – This is the same measures you’d take when your dealing with the electric company, if it is important enough, you’d install UPSs and generators and alternate sites and whatnot, you just have to figure out how important business continuousness is

I guess we’ll all wait to see how all this unfolds and what will be the after-effects of this outage on Cloud-computing. I personally think that it is still a good move in many cases, however this incident, help focus that the responsibility for our application’s wellness  is still ultimately  ours

Edit Apr. 24th:

I just read a post in Coding Horror which refers to a year old post in Netflix’s blog called “5 lessons we’ve learned using AWS [Amazon WebSevices]“. Netflix, in case you’re wondering survived Amazon’s outage and indeed, in lesson #3 they explain that if you want to survive failures you have to plan and constantly test for it:

3. The best way to avoid failure is to fail constantly. We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends. If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.
References
Published at DZone with permission of Arnon Rotem-gal-oz, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Jason Marshall replied on Tue, 2011/05/03 - 11:38am

I really don't understand how everyone talks about Cloud Computing as decentralization.

If you can't trust the Amazon Cloud, can you trust some rinky dink shop up the street?  I would not bet my job/company on it.  Which means that between three and half a dozen large Cloud providers will be used by 90% of the customer base.  That's not decentralization, that's centralization.  Consolidation.

Arnon Rotem-gal-oz replied on Tue, 2011/05/03 - 12:04pm

@Jason I don't think Cloud computing has to do with centralization or decentralization. While naturally it is about building ever larger data-centers for the cloud providers the companies using the cloud don't care one way or the other - We do care about outages, QoS, latencies etc. but we don't care how we get that. In other words it is about Utility computing - you want another light-bulb you flick a switch, you want another instance you flick a swith... In this respect it is interesting to re-read Nicolas Carr's 2003 article "It doesn't matter" which talks about the comodization of IT

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.