DevOps Zone is brought to you in partnership with:

Mitch Pronschinske is the Lead Research Analyst at DZone. Researching and compiling content for DZone's research guides is his primary job. He likes to make his own ringtones, watches cartoons/anime, enjoys card and board games, and plays the accordion. Mitch is a DZone Zone Leader and has posted 2578 posts at DZone. You can read more from them at their website. View Full User Profile

A Peek at Google’s Production Distributed Systems Tracing Infrastructure

11.08.2011
| 5733 views |
  • submit to reddit
It seems to be the case that if we learn something from the stories about technology and processes that major tech companies use, we might be able to bring some of those best-of-the-best practices back to our neck of the woods.  We're fascinated when we get a chance to hear about how eBay is using the cloud, or how Twitter is using Java and Lucene, or how Netflix and Facebook are using Apache Cassandra.  Many of these companies have processes and technologies that have built a reputation for unparalleled reliability.  Google is probably one of just a few companies who seem like their infrastructures are too big to fail.

If you're in the business of software deployment (especially the DevOps folks) you'll definitely need to get your hands on Google's recent paper on Dapper, a Large-Scale Distributed Systems Tracing Infrastructure.  Here's the abstract:




Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facili- ties. Tools that aid in understanding system behavior and reasoning about performance issues are invaluable in such an environment.

Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie [3] and X-Trace [12], but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries.

The main goal of this paper is to report on our experience building, deploying and using the system for over two years, since Dapper’s foremost measure of success has been its usefulness to developer and operations teams. Dapper began as a self-contained tracing tool but evolved into a monitoring platform which has enabled the creation of many different tools, some of which were not anticipated by its designers. We describe a few of the analysis tools that have been built using Dapper, share statistics about its usage within Google, present some example use cases, and discuss lessons learned so far.


I recommend you take some time to read the full paper and be amazed by this inside story on how google has been deploying for the past 2 years.