Cloud Zone is brought to you in partnership with:

I am the API Evangelist. Not in the sense that I’m evangelizing a single API to you--In the sense that APIs are important for everyone to be aware of. I’m paying attention to not just the technical, but the business and politics of the web API movement. I share my insights by blogging on the business of APIs at apievangelist.com, politics of APIs at apivoice.com and you can find more information about me at kinlane.com. Kin is a DZone MVB and is not an employee of DZone and has posted 72 posts at DZone. You can read more from them at their website. View Full User Profile

Army of Monkeys Trying to Take Down Netflix

07.22.2011
| 1846 views |
  • submit to reddit
Netflix wrote an interesting post about making their systems, redundant and more fault-tolerant, while running in the Clouds. Since no single component in the cloud can guarantee 100% up-time, Netflix has to design a cloud architecture where individual components can fail without affecting the availability of the entire system.

To do this, Netflix uses techniques like graceful degradation on dependency failures, as well as node-, rack-, datacenter/availability-zone and even regionally-redundant deployments. However, designing a fault tolerant architecture is not enough, they have to be able to regularly test and know if they can survive any outage.

So Netflix built Chaos Monkey, a tool that randomly disables their production instances, to make sure Netflix can survive this common type of failure without any customer impact. Neflix runs this during regular business hours in a very controlled way, to learn about how their network responds, as they work to make it more resilient.

Inspired by the success of the Chaos Monkey, they’ve started creating a whole new line of simians that induce various kinds of failures, or detect abnormal conditions, and test our ability to survive them:

  • Latency Monkey – Induces artificial delays in the NEtflix RESTful client-server communication layer to simulate service degradation.
  • Conformity Monkey – Discovers instances that don’t adhere to best-practices and shuts them down.
  • Doctor Monkey – Taps into health checks that run on each instance as well as monitors other external signs of health to detect unhealthy instances.
  • Janitor Monkey – Ensures that the Netflix cloud environment is running free of clutter and waste, by searching for unused resources and disposing of them.
  • Security Monkey – An extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances.
  • 10-18 Monkey – Detects configuration and run time problems in instances serving customers in multiple geographic regions, using different languages and character sets.
  • Chaos Gorilla - Similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone.

As Netflix grows at an unprecendented pace, testing for all types of failure scenarios will be a never-ending job. They are in some seriously uncharted territory, with the number of instances they run and bandwidth they consume, while running completely in the Amazon cloud.

The Netflix simian army can be a model for cloud fault tolerance, that the rest of us can learn from. Let’s just hope the Netflix simian army never develops a mind of its and comes after the rest of us.

References
Published at DZone with permission of Kin Lane, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)