Lessons from Chaos Monkey: Embracing Chaos to Bring Order to Service Disruptions

essidsolutions

When an outage hits, organizations must be prepared to survive with as little disruption as possible. Mike Loukides, VP of emerging tech content, O’Reilly Media, talks about Netflix’s counterintuitive solution to improving resiliency among its systems and operations teams. 

As much effort as businesses put into making their systems reliable, disruptions can, do, and will happen. Service disruptions are an unavoidable truth even for the most well-prepared operations teams. Whether because of a cloud outage, a ransomware attack, or other failure, services will go down at some point, resulting in lost business, significant recovery costs, and damage to a company’s reputation.

It is much easier to accomplish efficient outage management if the operations team already has experience with handling outages. This is the idea behind chaos engineering, which breaks parts of a software system to give teams real-world practice in fixing them. In the face of a system’s failure, it ensures that crisis mode isn’t panic mode. If the team has seen this before, they can fix the failure confidently and quickly.

By practicing chaos engineering, organizations build resiliency that allows them to recover from an outage with only a relatively small degradation in service. The team has already done restoring service when systems are failing, making chaos engineering the difference between hoping you can deal with an outage and knowing you can. Such was the case with Netflix, which invented Chaos MonkeyOpens a new window – and the discipline of chaos engineering – as a way to test the resilience of its IT infrastructure by intentionally disabling computers in its production network.

By delving more into Netflix’s experiences with embracing chaos, organizations can better understand why the approach of “breaking” things to learn improves teams’ capability to identify and eliminate problems before they manifest as outages.

Breaking to Learn: Netflix’s Counterintuitive Solution to Improving Resiliency

Chaos engineering is rooted in a realistic view of the dependability of networks and systems. Assuming that networks are reliable and secure, that there are no limitations on bandwidth, and that latency isn’t an issue is a big mistake. Realistically, developers must assume that networks are unreliable and prone to flaking out at any moment.

Netflix was aware of the vulnerable nature of networked systems when it moved its operations to Amazon Web Services – unlike many companies that moved to the cloud, and generated a plan to account for this. By asking themselves how they could build a reliable software system in an unreliable environment, Netflix’s engineers came up with a counterintuitive solution. Enter Chaos Monkey, a software tool that did the equivalent of unplugging servers, breaking network connections, and unplugging databases. 

Chaos Monkey made it clear to developers that they could not assume that the network was reliable and that everything would just work. It built failure into the system. This delivered valuable lessons to the operations team. In best case instances, performance would degrade, but the system would keep running, and the team got indispensable practice in diagnosing and fixing the problem. In other instances, performance would degrade to an intolerable extent, and developers gathered data that they could use to make the systems more reliable. Both of these scenarios are important. Systems crash and networks fail. When this inevitably happens, operations teams should have a well-oiled game plan for handling the outage, regardless of its severity. 

Real-world tests such as these demonstrate to developers that they can’t assume reliability. Even when everything is working, Chaos Monkey generated failures. It’s very easy to build a system that claims to be redundant but that fails in the real world for reasons that aren’t foreseeable. The fact is redundancy sounds a lot easier than it is. It’s very easy to build a system that seems redundant but is actually more prone to failure. And, in a system of Netflix’s size, components will fail frequently. At its core, Chaos Monkey forced developers to build systems and procedures that were genuinely reliable.

Another key takeaway from Chaos Monkey is that operations teams should not be fixing outages by hand; this isn’t scalable. Chaos engineering helps developers build systems that don’t need manual intervention – instead, they can detect outages and recover independently. While the use of Chaos Monkey was indeed chaotic, it was undoubtedly successful. In fact, it’s one of the self-reported reasons Netflix was able to weather the Amazon outage of April 2011Opens a new window with only minor service degradation, when many companies were taken offline. 

Netflix has since built on Chaos Monkey by creating the Simian ArmyOpens a new window , a collection of services that inject different kinds of failures into their systems, such as variations in latency, security problems, and even more widespread outages. At its most extreme, Chaos Gorilla simulates an outage of an entire AWS availability zone. It’s a drastic form of testing, but the alternatives are worse – in other words, if the Simian Army doesn’t break things, an organization’s cloud service will eventually break them. 

See More: How To Implement Chaos Engineering Using Cloud To Improve On-Prem Systems

The Benefits of Organizational Chaos 

Now that we’ve peeked behind the curtain let’s take a step back to digest how chaos engineering can benefit organizations beyond Netflix. Chaos engineering, simply defined, means writing software that automatically “breaks” random parts of a system at random times. This can mean disabling servers, turning off databases, or shutting down network connections. Basically, it aims to create almost any kind of problem in a software production system at unexpected times.

This is a good thing – as it simulates what were to happen in the real world. Systems are never tested as well as they should be. Real-world problems don’t arrive on a schedule. Chaos engineering creates unexpected issues with real consequences to ensure systems can respond adequately—not by simulating problems, guessing what can go wrong, but by breaking things and seeing whether the system survives.

Perhaps even more critically, chaos engineering tests the operations team’s ability to bring systems back online after they’re down. For example, if your organization falls victim to a ransomware attack, do you know if your backups were done properly? Can you restore from backups and get back online? For too many companies, the answer to that question is: “We hope so.” The point of chaos engineering is that you don’t have to hope. Organizations can be confident in their ability to recover because sometime in the recent past, an automated tool (such as the original Chaos Monkey) disabled all the file servers, and the operations team has had to bring everything back online. 

A failure shouldn’t be an emergency, it should be something the team can deal with without thinking twice. And the only way to train a team to handle failures without going into crisis mode is to ensure that everyone has real-world experience handling failures. For those interested in learning more, Casey Rosenthal and Nora Jones, who pioneered chaos engineering while working together at Netflix, wrote a practical guideOpens a new window that shows engineers how this discipline enables their organization to navigate complexity.

Practice the Unexpected

We know that components will fail. Outages in complex cloud infrastructures are becoming more common—AWS, the largest cloud provider and a generally reliable one, suffered three outages in December alone. Meanwhile, ransomware and other cyberattacks are increasing in frequency and scope.

This means that operations teams not only have to expect the unexpected – they have to practice for it. Chaos engineering helps developers flush out bugs, discover failure modes that weren’t anticipated, simulate what happens when a complex distributed system meets the real world and design systems that can repair themselves.

An organization will never be able to build systems that won’t fail. That’s impossible. However, they can build teams that know how to handle major or minor failures. The best way to do this is to provide teams with practice in real-world failures. 

How are you embracing chaos in your life? Share with us on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We’d love to know!

MORE ON CHAOS ENGINEERING:Â