Chaos Engineering Aims to Avert Disasters


Modern web services like WhatsApp and Uber have based their business models on the reliability and stability of their web services. Yet the need to keep this service in “uptime,” running without problems, is such a priority that an entire support industry has been born to search for and identify solutions to potential problems, known as “chaos engineering.”

Chaos engineering injects harm into a system to test it out and see how it responds, preemptively preparing for, and minimizing, downtime and outages before such problems occur in real time.

Chaos engineering is vital, enabling study of how a single element can damage a system in a controlled environment, helping engineers to identify the role of an individual item in a mix of events that damages a system.

Analyzing the Domino Effect

A minor disruption in an availability zone belonging to Amazon Web Services could cripple a service. And while developers can call on tools to identify problems, the tools on offer don’t trawl systematically in order to ascertain an appropriate response.

To do so, chaos engineers simulate events, for example congestion in the database or problems with storage, and then design ways to respond.

The approach is a radical departure from traditional code testing by attempting to identify places where breakdowns could occur, without running the test on the system itself.

A chaos engineer would simulate how the loss of a database might impact Airbnb but would not perform the actual shutdown on the site itself in order to reduce the negative impact of testing.

Chaos engineering also aims to generate new knowledge about how elements of a system work together in a variety of conditions in ways that might not have been anticipated.

“We also want to explore things like a large increase in traffic, race conditions, byzantine failures (poorly behaved nodes generating faulty responses, misrepresenting behavior, producing different data to different observers, etc.), and unplanned or uncommon combinations of messages,” says Ali BasiriOpens a new window , a chaos engineering expert. “Failure testing breaks a system in some preconceived way, but doesn’t explore the wide-open field of weird, unpredictable things that could happen.”

Cloud Behaviour Increases Calls for Chaos Engineers

The need for chaos engineering is growing as web service companies are increasingly scattered across multiple data centers to reduce storage costs.

It’s a new field, inspired by in-house chaos engineering done at Netflix in 2015. Gremlin, a startup founded by Kolton Andrus, is one of the leading chaos engineering firms, raising $18 million in 2018 to continue developing its engineering team.

“In the modern cloud era — where systems are distributed, containerized, and highly ephemeral — it’s become nearly impossible to have a complete understanding of system behavior without doing the kind of proactive testing Gremlin offers,” says Tomasz TunguzOpens a new window , a venture capitalist with Redpoint, noting that companies like Twilio, Expedia and Under Armour are all using chaos engineering to manage their systems.

Examples of real-world events that chaos engineers might test for include hardware failures, running out of memory or storage, spikes in traffic or the unavailability of the service itself. The testing should ideally occur while the web service is still being produced, or soon after, to prevent malperformance before it happens.

Is Poor Design Creating its Own Industry?

Critics are expressing concern that the development of a field such as chaos engineering may prevent the needed focus on designing a solid web service upfront.

“The typical counterarguments are that the principle is a band-aid for applications that were poorly planned and architected in the first place, or that it’s another buzzword-laden excuse to invent shiny new tools that no one knew they needed,” says Chris Ward in DevOps Zone.

Ward points out that chaos engineering is most effective when used to measure the impact of specific events, such as the projected drop in business that would come from PayPal being off-service for a specific period of time.

Businesses employing chaos engineering to test the web service performance also need to ensure that they have the tools to analyze the results. They should also assess how they tie back to the original hypothesis of an experiment.

Above all, the testing in chaos engineering should never itself trigger chaos for the customers it is intended to protect, known in the industry as minimizing the “blast radius.”

“Experimenting in production has the potential to cause unnecessary customer pain,” cautions the writers of the ‘Principles of Chaos’ siteOpens a new window , an effort to establish principles for the field. “While there must be an allowance for some short-term negative impact, it is the responsibility and obligation of the chaos engineer to ensure the fallout from experiments are minimized and contained.”