The news this past week has brought endless images of devastation from all across the metropolitan region.
More than once, in conversations about the recovery efforts, I’ve commented, “That’s similar to what I do.” Web operations is every bit about disaster recovery and crisis management in the datacenter. If you saw Con Edison down in the trenches, you might not know how that power gets to your building, or what all those pipes down there do, but you know when it’s out. You know when something is out of order.
That’s why datacenter operations can learn so much about crisis management from the handling of Hurricane Sandy.
1. Run Fire Drills
Nothing can substitute for real world testing. Run your application through it’s paces, pull the plugs, pull the power. You need to know what’s going to go wrong before it happens. Put your application on life support, and see how it handles. Failover to backup servers, restore the entire application stack and components from backups.
2. Let the Pros Handle Cleanup
This week Fred Wilson blogged about a small data room his family managed for their personal photos, videos, music and so forth. He ruminated on what would have happened to that home datacenter, were he living there today when Sandy struck.
It’s a story many of us can relate to, and points to obvious advantages of moving to the cloud. Handing things over to the pros means basic best practices will be followed. EBS storage, for example is redundant, so a single harddrive failure won’t take you out. What’s more, S3 offers geographically distributed redundant copies of your data.Web Operations teams do what Con Edison does, but for the interwebs. We drill down into the bowels of our digital city, find the wires that are crossed, and repair them. Crisis management rules the day. I can admire how quickly they’ve brought NYC back up and running after the wrath of storm Sandy.
3. Have a few different backup plans
Watching New Yorkers find alternate means of transportation into the city has been nothing short of inspirational. Trains not running? A bus services takes it’s place. L trains not crossing the river? A huge stream of bikes takes to the williamsburg bridge to get workers to where they need to go.
Deploying on Amazon can be a great cloud option, but consider using multiple cloud providers to give you even more redundancy. Don’t put all your eggs in one basket.
4. Keep Open Lines of Communication
While recovery continued apace, city dwellers below 34th street looked to text messages, and old school radios to get news and updates. When would power be restored? Does my building use gas or steam to heat? Why are certain streets coming back online, while others remain dark?
During an emergency like this one, it becomes obvious how important lines of communication are. So to in datacenter crisis management, key people from business units, operations teams, and dev all must coordinate. Orchestrating that is and art all by itself.