This is a guest post by Patrick Eaton, Software Engineer and Distributed Systems Architect at Stackdriver.
Stackdriver provides intelligent monitoring-as-a-service for cloud hosted applications. Behind this easy-to-use service is a large distributed system for collecting and storing metrics and events, monitoring and alerting on them, analyzing them, and serving up all the results in a web UI. Because we ourselves run in the cloud (mostly on AWS), we spend a lot of time thinking about how to deal with faults in the cloud. We have developed a framework for thinking about fault mitigation for large, cloud-hosted systems. We endearingly call this framework the “Four Hamiltons” because it is inspired by an article from James Hamilton, the Vice President and Distinguished Engineer at Amazon Web Services.
The article that led to this framework is called “The Power Failure Seen Around the World”. Hamilton analyzes the causes of the power outage that affected Super Bowl XLVII in early 2013. In the article, Hamilton writes:
As when looking at any system faults, the tools we have to mitigate the impact are: 1) avoid the fault entirely, 2) protect against the fault with redundancy, 3) minimize the impact of the fault through small fault zones, and 4) minimize the impact through fast recovery.