We are building an instrumentation platform that runs across dozens of datacenters to provide operational visibility for internal systems and applications. This platform must remain up as much as possible and allow support and operations staff to understand and diagnose problems quickly. They must be able to ask questions like "what machines and applications are publishing metrics?", "what systems appear to be offline?", "what order did these errors occur in?", all without consulting every datacenter. Furthermore, they must be able to change configuration quickly, with confidence that every affected system will receive and act upon it.
To help with these problems, we are implementing several recently developed protocols for cluster membership, epidemic broadcast, and monotonic time. Respectively, these protocols allow us to know what nodes are peers, to disseminate configuration and status information, and to agree on roughly relative orders of events. Best of all, they are all synchronization-free, meaning we can achieve our goals while remaining highly available. In this talk, we'll discuss the protocols we chose, challenges to implementing them, and some preliminary results from deploying the protocols across our infrastructure.