How To Implement Chaos Engineering Using Cloud To Improve On-Prem Systems

essidsolutions

In this article, Tony Perez, cloud solution architect at Skytap discusses how to apply chaos engineering practices to traditional applications by recreating production environments in the cloud and resetting them between tests using automation.

Back in 2011, Netflix introduced a tool called Chaos Monkey to inject random failures into their cloud architecture as a strategy to identify design weaknesses. Fast forward to today, and the concept of resiliency engineering has evolved to the point where “Chaos Engineer” is an actual job.

Many companies like Twilio, Facebook, Google, Microsoft, Amazon, Netflix, and LinkedIn use chaos as a way to understand their distributed systems and architectures. While these companies rely on cloud-native architectures, there may be a way to also apply the practice of Chaos Engineering to improve the reliability of traditional data center applications that may never be moved to the cloud.

That is significant for the many organizations whose business critical functions still rely on J2EE, Websphere, MQ-Series, or client-server applications that have been around for many years. Even if these apps are adequately maintained, they might still suffer from outages and have quality control problems.

As organizations weigh whether to invest in updating their traditional applications, one obvious path is to “rewrite” and migrate them to the cloud using the Refactor/Re-architectOpens a new window strategy from Amazon’s 6Rs. But that requires both budget and a deep talent pool that is out of reach for many traditional organizations. And if core applications are partially or wholly based on AIX or IBMi, the likelihood of recreating them in a complete cloud-native fashion seems remote.

Performing Chaos Engineering To Improve Traditional Applications

Let us imagine there is a magical way to perform Chaos Engineering to improve these traditional applications; what types of tests would you execute? It does not take much imagination to come up with an extensive list of possible test scenarios that can be applied to traditional application architectures using historical failures as a starting point. The obvious tests would be resource-based like:

  • Low memory
  • Not enough CPU
  • Full disk volumes
  • Low network bandwidth, high latency
  • Hardware failures like a failed disk drive, failed server, disconnected network

And not so obvious ones could be:

  • Database/server process down
  • Microservice down
  • Application code failure
  • Expired Certificate(s)

And even less obvious:

  • Is there sufficient monitoring, and have alarms been validated?
  • Understanding the repair timeOpens a new window to correct different types of problems once identified. If a database server goes down, how long does it take to bring it back up along with all the associated application components that talk to the database?

Performing this type of aggressive testing against your traditional on-prem applications could provide significant benefits. You might extend the life of that system if you could make it more reliable. You might even be able to put off the “rewrite” decision for the foreseeable future.

Using Cloud To Implement Chaos Engineering To Traditional Applications

Before you could even begin testing like this, however, there are a few additional obstacles. You probably do not want to inject chaos into your production application. Also, you might not have a test system that fully represents production, so the results of testing might not carry over. And finally, even if you have a test system that looks like production, and you could run “destructive” tests against it, how long will it take to repair it to run the next experiment?

It turns out, the solution to these obstacles –  and the “magical” way to apply chaos engineering practices to traditional applications –   is to use the cloud.

With the cloud, there is the potential to create a production-like environment with all the same application components as the original system of record. In this model, it is not necessary to convert anything to “cloud-native.” Just do a simple lift-and-shift, changing no lines of application code, and run all the same servers as the original application; they would just be virtual machines in the cloud. Using the same IP addresses, same hostnames, same network topology, the same amount of memory, disk, etc., you would recreate the original application’s “twin” in the cloud. And the twin would be where you would do various chaos engineering tests to observe the behavior of individual components as well as the overall system in general. This even can be done for applications that will stay on-premise.

Of course, infrastructure in the cloud will not exactly replicate what you have on-prem, but there are workarounds for testing purposes. For example, the model and capacity of your enterprise SAN (storage array) will not be replicated in the cloud, so you would not be able to do a test of “failing the SAN.” What you can easily do in the cloud is disconnect or manipulate a disk attached to a VM to simulate a failure. And for network components, you can disconnect the virtual network from a cloud-based VM, and it would be similar to what would happen on-prem if a physical or virtual network segment failed.

By recreating a representation of your traditional on-prem application in the cloud that works “the same as it does now,” with no redesign, you are then free to do aggressive tests on it to determine how to make it better. And by making it better, you can extend its life.

But if you run a bunch of tests that eventually destroy the application clone running in the cloud, how do you reset it for the next round of tests? Rebuilding or fixing things by hand could take days, weeks, or longer. Another benefit of the cloud for this type of testing is the ability to have effectively limitless numbers of clones.

If you are already doing “infrastructure as codeOpens a new window ,” you might already have the scripts and tooling to re-create the system from scratch. Different clouds have different approaches to this. Whichever you choose, the goal is to be able to “quickly” re-create a ready-to-use running set of infrastructure and application components that represent the entire working original application, including all the servers (VMs), storage, networks, installed software, the configuration of the OS, everything. And be able to do that in minutes or hours, not days or weeks.

Your goal is to re-create an accurate technical representation of the original on-prem system of record but have it running in the cloud. All the VMs, networks, storage, and data on the volumes should be included. I have written about this in more detail here: “Resetting QA Test Data: The Cloud WayOpens a new window .” The collection of all technical infrastructure that makes up a representation of an application is called an “environment.”

Your goal is to have multiple environments – like PRODUCTION, PRE-PROD, QA#1, QA#2, and others called “CHAOS#1”, and maybe “CHAOS#2”, etc. –  running simultaneously. Almost all clouds provide some form of “environment isolation” that provides a mechanism for a group of servers to re-use the same IP addresses and hostnames as other server groups without having the IP address space collide.

Duplicating IP address space is typically very difficult to do on-prem, so the temptation is to “Re-IP” (re-assign IP addresses and hostnames) to servers so they do not collide with the originals. The downfall with this approach is that you have now fundamentally changed the representation of the original system. So your chaos testing might end up introducing incorrect results due to hostnames and IP addresses not matching the original application.

To achieve maximum value from your chaos testing, you should recreate the same RFC-1918 address spaces that you are using on-prem in the cloud. What if the chaos system you have built in the cloud needs to talk to other applications back on-prem? All of the major cloud services have some form of network address translation (NAT) technique. This allows for each environment in the cloud that might be using duplicate address spaces to communicate back to other resources on-prem in a way that prevents the IP addresses from colliding.

The final ingredient is to have a way to “save” or “re-create” the cloud implementation using automation or an environment-saving technique. One of the reasons to use the cloud for chaos testing of on-prem applications is so that you can do a “fast reset” of the entire system in between test runs. If you run a destructive test on your cloud-based environment, you do not want to spend days or weeks resetting everything for another test run. Your goal should be to reset or re-create the system in the cloud within minutes or hours. That way, you can run multiple chaos test scenarios as fast as possible. As mentioned before, I have covered how you might reset application test data in between test runs in more detail hereOpens a new window .

The Chaos Workflow

And so your final chaos workflow is:

  • Import your on-prem environment into the cloud. This will be the longest part of the initial process. Different clouds have different capabilities for bringing content such as VMware or AIX images in from on-prem or even restoring systems like IBMi from a backup.
  • Once you have a working application, you need to “save it” somehow so you can re-create on-demand clones in a short time.

Then your actual test workflow is:

  1. Deploy a clone of the application from your Template (or scripts).
  2. Run your “chaos” tests and collect your results.
  3. Once your tests are done, completely delete your test environment.
  4. When you are ready for the next test, go to step #1.

What seemed like magic is really a straightforward and practical way to help you improve the quality and reliability of on-prem applications that will never run in production in the cloud. Using a lift-and-shift model to make the cloud-based copies that look and act like traditional on-prem applications, you can hack on them to your heart’s content. The cloud provides an on-demand sandbox where you can create and destroy things and then quickly recover without risking your original systems. This concept works for original application systems of record, disaster recovery systems, and even software development pipelines. While it started out as a cloud-native approach to improving system resilience, chaos engineering turns out to be great for traditional applications as well.

Did you enjoy reading this article? Let us know your thoughts in the comment section below or on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We would love to hear from you!