Discovering Facebook Systems and Practices - Operating At Scale
Reliability is important to Facebook but failures will occur, and the Operations and Infrastructure Engineering teams need to respond to these failures quickly. Facebook is a fast paced environment and the principle of moving fast is applied not only to its engineering practices but also to how things are fixed. Systems, processes and culture all work together to make this happen.
This talk will highlight some of the systems and practices that are employed at Facebook to manage systems and software at scale. I will use a few case studies to describe how these are built and provide guidelines for how others can build their own systems and operations teams that can scale with infrastructure growth. It will touch on concepts like automation, communication, monitoring, incident management, infrastructure design & code releases.