Chaos Engineering: Test System Resilience Before a Disaster Strikes


As companies worldwide increasingly move to microservices in search of greater scalability and flexibility, their systems are becoming more complex. The more complex a system becomes, the more likely one or more components of the system will fail. The combination of cloud computing, microservice architecture, and bare-metal infrastructure has created many moving parts and potential points of failure, none of which are predictable. When a new software service is launched into this environment of unknowns, the chances of a breakdown or failure increase manifold. Chaos engineering is intended to eliminate most of this unpredictability by putting this complexity to the test. By proactively experimenting to see how a system responds to a failure condition, chaos engineering gives you insights into how every system component can fail before an actual outage actually occurs.

Why Do Containers Need Chaos Engineering?

Because the Kubernetes platform and its workloads are constantly changing, you cannot simply tune your environment, set it, and then forget about it. You cannot assume that if a container goes down, it is going to come right back up by itself; you may have to triage it manually. You also cannot assume that there won’t be any log or disk failures. Outages beyond your control could occur at any time; your dependencies could experience incidents, or your key services could start responding slowly. Minor disruptions in one area can be magnified or have long-standing side-effects on other systems in a network. Service disruptions can negatively affect developer productivity, customer trust in your business, and possibly your bottom line.  

According to Gremlin’s Vice President of Marketing Aileen HorganOpens a new window , “testing based on chaos engineering techniques is gaining traction in Kubernetes environments because of the dependencies that exist between microservices deployed on Kubernetes clusters.” Chaos engineering practices that are applied to Kubernetes clusters help minimize the downtime of your services and bolster overall security and privacy. 

Resilience of a system is basically how well it can recover from a disruptive event. Chaos engineering helps you understand your entire system’s resilience in the environment where it will run, not just its components. Chaos engineering subjects a system to actual failures and dependency disruptions that it is likely to encounter in production, such as servers that crash or dependencies that fail. The aim is not to break the system but to make sure that your infrastructure, services, and systems are highly resilient and reliable. Chaos engineering not only validates your application’s fault-handling process but also gauges the deployment resilience of Kubernetes clusters and related infrastructure components during deployment as well as in production. Testing can be carried out for Amazon Elastic Kubernetes Service (EKS), Azure Kubernetes Service (AKS), and Google Kubernetes Engine (GKE) reliability to determine how these react when one or more containers are turned off. This will give you an idea as to how their autoscaling works or, in general, how they react with the rest of your system’s dependencies.

Learn More:  Chaos Engineering Aims to Avert Disasters 

Which Tests to Carry Out in Chaos Engineering?

When deciding which chaos engineering tests to apply to Kubernetes, you should focus on the most common weak points found in Kubernetes implementations. Sample tests would include:

  • Disabling network nodes to observe how your applications recover when IP addresses change or networks become faulty
  • Simulating server downtime to observe how your containers react and how they automatically reallocate resources
  • Simulating varying levels of global traffic to observe the Kubernetes load balancing process
  • Simulating internal DNS failures to reveal how Kubernetes pods react and how traffic changes
  • Introducing drive latency to observe how Kubernetes handles resource balancing 

How to Avoid Chaos Engineering Chaos?

Depending on the scope of your testing, chaos engineering can cause unexpected, unwanted outcomes. For example, a test you thought would bring down only one container can bring down several on the same host. Successful chaos engineering practices work to minimize these outcomes so that even though your customers may experience temporary blips, there will be no significant disruptions experienced. 

Chaos engineering involves causing one or more of your system components to fail on purpose to find out where the system’s weaknesses lie before they have a chance to cause a major outage. The goal here is to fix any problems you find before they break unexpectedly. For this purpose, it is essential to follow some best practices, which have been discussed in detail in the next section.

Learn More:  The Importance of System Availability and Data Security on Black Friday and Cyber Monday; and How to Ensure IT 

Best Practices for Chaos Engineering

1. Use game days in production: A game day is a coordinated simulation of an outage or incident to validate that the system can handle the issue correctly. With fault injection tooling, teams can choreograph faults that represent a hypothetical scenario in a controlled manner. This typically includes validation of monitoring systems and human processes that come into play during such incidents.

2. Prevent unintended consequences: Safeguards should be put in place to ensure that faults introduced in a pre-production environment will not affect production or impact real customer traffic. Firstly, the blast radius of a fault scenario should be contained to minimize the impact on other components and your customers. Secondly, the ability to inject faults should be restricted to approved personnel to prevent unintentional consequences or attacks by hackers. Lastly, have a failsafe mechanism in place to ensure that an experiment can be aborted and rolled back whenever needed. Chaos engineering system services provide the ability to revert systems back to their original states in time so that real users are not unduly affected. 

3. Perform continuous testing throughout your container lifecycles: A key element of successful chaos engineering is monitoring and testing throughout development, deployment, and release cycles. To achieve this goal, it is recommended that the process be integrated in DevOps value chains. It is critical that testing is performed continuously throughout your container lifecycles so that all the required checks are in place.

4. Have monitoring / observability in place: As stated by Charity Majors, CEO of Honeycomb: “chaos engineering without observability is just chaos.” At the start, you need to determine the system’s steady state that is going to be used to gauge how it will react to a chaos engineering experiment. In the absence of proper monitoring / observability, you will have little to no knowledge regarding how your systems / services are performing and how they react when you get down to performing the chaos engineering experiment. Proper monitoring will keep you abreast of the situation with knowledge of when you need to stop the experiment and run production; the percentage of customers dropping is one such indication.

You also need to monitor system metrics such as how your resources (CPU, I/O, memory, and disk) handle user loads in general and the experiment in particular. Measuring the availability of your services and applications against your company’s KPIs would also be required. You also need to be informed about any customer complaints, especially while running a test in production. You can use these metrics to determine how your Kubernetes implementation behaves typically. Chaos testing will reveal how these metrics change. Only when you are sure how your network should work can chaos engineering verify that your system is functional accordingly.

5. Automate chaos engineering experiments to run continuously: Running experiments manually is labor-intensive and cannot be sustained for long periods of time. As such, it is better to automate your tests and run them continuously. Chaos engineering will help integrate automation into the system to carry out both orchestration and analysis in a systematic manner.

6. Determine a blast radius that causes minimum disruption: A blast radius is an actual impact the chaos engineering experiment will have on your entire system. Under no circumstances should you start your chaos engineering experiment in production before determining the blast radius and how it will affect your development or QA environment. You should confine your blast radius so that you’re not running your experiment against your entire infrastructure or at maximum capacity. 

7. Fix problems that have been unearthed: If your experiment points toward some weak links in your systems or services, make sure they are fixed at the earliest. Doing this will only help in improving the reliability of the components. Once this is done, run the same chaos engineering experiment repeatedly, maximizing the percentage with each subsequent experiment. 

8. Communicate clearly what you expect to achieve from your chaos engineering experiments: Unless everyone in your organization does not completely believe in the advantages of chaos engineering, there is bound to be confusion. You also do not want the notion that you simply break systems and components randomly without having a legitimate reason to be doing the rounds. Remember, chaos engineering can affect traffic and real users, which will have an adverse effect on other teams in the organization. Before you conduct any such test, make sure everyone is informed about what and when you are planning to experiment on. This way, if other team members notice an untoward incident on their screens due to one of the services, they will understand that it is due to the ongoing experiment and not an actual incident like a DDOS attack. In short, communicating and collaborating with everyone involved will invariably determine how resilient your systems actually are and allow you to make continuous improvements in them. 

Learn More: How to Keep Your Pipelines Clear and Avoid Delivery Bottlenecks 

In Conclusion

With systems across organizations and verticals becoming extremely complex, the need for putting relevant checks in place to avoid untoward incidents affecting the business has never been greater. Efficient testing of software and systems through chaos engineering will determine their resilience to real outages and failures. This can go a long way in making sure that no disruptions will occur.

Do you think chaos engineering can improve system reliability and minimize the negative impact of downtime? Tell us what you think on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We would love to hear from you.