Making Microservices More Resilient with Chaos Engineering

Original: https://www.nginx.com/blog/microservices-more-resilient-chaos-engineering/

Microservices have become a very popular pattern for teams that develop and deploy services. Using microservices gives developers a smaller, more focused codebase to work with, and more independence in when and how they deploy their service. These are big advantages over using a monolith.

There is no such thing as a free lunch, however. Complexity doesn’t disappear when you transition from monolith to microservices – it just shifts around a bit. Development of an indivdual microservice is easier because of the smaller codebase, but operating microservices in production can become exponentially more complex. There are likely many more hosts and/or containers running in a system built with microservices – more load balancers, more firewall rules, etc. You might be using NGINX for different purposes (web serving, reverse proxying, load balancing) for different microservices. As you grow from tens to hundreds or even thousands of services, it becomes harder to understand the system and predict its behavior. Moreover, the services are all communicating with each other over the network instead of with inter‑module calls within the monolith.

How can we validate that our microservices‑based system not only operates as designed under normal conditions, but can also handle unexpected failures or performance degradation in its environment? One great way is Chaos Engineering. Chaos Engineering is a practice that can help your team better manage the applications you have running in production, and make your system more resilient.

What Is Chaos Engineering?

We define Chaos Engineering as thoughtful, planned experiments designed to reveal weaknesses in our systems. A metaphor we often use is vaccination, where a potentially harmful agent is injected into the body for the purpose of preventing future infections. In a Chaos Engineering experiment we actively “inject failure” into our systems to test their resilience. We carry out these experiments using the scientific method: form a hypothesis, carry out the experiment, and see if it validates the hypothesis or not.

Some of the types of failure we can inject into a system include shutting down hosts or containers, adding CPU load or memory pressure, and adding network latency or packet loss. There are others as well, but this gives you an idea what kind of things we can do in an experiment.

Forming a Hypothesis

The first step in performing a Chaos Engineering experiment is forming our hypothesis. The hypothesis describes the impact we expect on the system from the failures we inject. Keep in mind that we are trying to test the resiliency of our systems. Generally our hypothesis is that the system will be resilient to the types of failures we inject. Sometimes we find that our hypothesis is not correct, though, and we can use what we learn to improve the system’s resilience.

For example, say that we have a stateless HTTP service running on NGINX that exposes a REST API to some of our other services. We’re running an instance of this service on 10 hosts in our production environment, because that’s how many are required to handle the current load without maxing out the CPU on each host. We can actively test whether we have built in enough redundancy by purposely taking down a host. In this case our hypothesis is “The system is resilient to the failure of a host – there will be no impact on other services or the people using the system.” Then we can perform the experiment to see if our hypothesis is correct or not.

Blast Radius, Magnitude, and Abort Conditions

Three important concepts to keep in mind when planning experiments are blast radius, magnitude, and abort conditions. Let’s take a closer look at what they are.

Blast radius is the proportion of hosts (or containers) we run the experiment on. This is a very important concept because we need to minimize the potential impact of the experiments on users, even in non‑production environments. The idea is that we start with a small blast radius (like one host or container), and then increase the blast radius as we learn more and get comfortable with the experiment.

Magnitude is the amount of stress or disruption we apply to the individual hosts or containers. For example, if we’re testing the effect of a CPU attack against a web server running NGINX, we might start off by adding 20% more CPU load (the magnitude) and increase that over time. We can observe the effect of increasing CPU load on service metrics like response time to determine how large an attack the system can withstand before performance becomes unacceptable.

Abort conditions are the conditions that cause us to halt the experiment. It’s good to have an idea in advance of what kinds (or amount) of effects on the system make the experiment too disruptive to continue. Those could be an increase in your error rate or latency, or perhaps a certain alert generated by your monitoring software. You can define the abort conditions however you want, and those definitions may vary from experiment to experiment.

Blast radius, magnitude, and abort conditions allow us to perform Chaos Engineering experiments safely. It’s important to always keep the users of the system in mind as we plan Chaos Engineering experiments, so we don’t impact them negatively. They are the reason we work to make the system more resilient and give them a better user experience.

Verifying Dependencies with Blackhole Attacks

One of the types of complexity that increases when you move from a monolith to microservices is additional dependencies. Instead of a monolithic application which contains all of the business logic for the system, you now have multiple services that depend on each other. Your microservices may also depend on other external services, like APIs from your cloud provider, or SaaS services you use as part of your infrastructure.

What happens when those external or internal dependencies fail? Do the safeguards you’ve placed in your code actually mitigate those failures? Are things like your timeout and retry logic tuned well for how your system is actually operating in production?

Blackhole attacks are a great way to test whether you can deal with failed dependencies. A blackhole attack blocks a host or container’s access to specific hostnames, IP addresses, and/or ports, to simulate what would happen if that resource was unavailable. This is a great way to simulate network‑ or firewall‑related outages, as well as network partitions.

In the case of an external dependency, let’s imagine that we’re operating a service that sends SMS messages to customers using the Twilio API. We know that our communications with Twilio might be interrupted at any time, so we have designed our microservice to read messages from a queue, and to delete them from the queue only after they are successfully sent to Twilio via the Twilio API. If the Twilio API is unavailable, the messages queue up on our end (possibly using a message bus like Kafka or ActiveMQ), and they will eventually be sent once communication with Twilio resumes. Sounds great, right?

But how do we know how our service actually behaves when our network connection to Twilio is severed, until we actually test that? How do we know that what we drew up on the white board when we designed the service is how it’s actually operating in production?

By running a blackhole attack where we block that service’s access to the Twilio API, we can see how it actually behaves. This can help us answer a lot of questions, like: Do the messages queue properly? Is the timeout we’ve set appropriate? Does the service continue to perform well as the message queue grows? Instead of looking at the code and making educated guesses at the answers to these questions, we can actually inject that failure and see what happens.

In this case our hypothesis might be, “Messages queue correctly while the network connection is down, and are delivered properly when it resumes.” We either prove or disprove that hypothesis when we run the blackhole attack. If we disprove it, we likely learn some things that will help make our system more resilient.

A blackhole attack can also be used to verify what happens when internal dependencies fail, and to discover hidden dependencies. Hidden dependencies are a common problem, and it’s great to discover them before they cause an incident. A hidden dependency happens when someone adds a new dependency on a service, but it’s not documented or communicated well within the organization. For example: Service A is updated so that it now depends on Service B, but the team operating Service B isn’t aware of that. They take Service B down for maintenance, and suddenly there’s an unexpected outage for Service A. If Service A is a critical service like your login service, or one that lets customers buy things, that could be a costly outage. This is not an uncommon issue for teams to have, as mapping and visualizing service dependencies can be difficult.

By periodically running blackhole attacks on your services, you can surface these hidden dependencies, so that the teams involved are aware of them. You can also get the other benefits we discussed with external dependencies. You can see how your service responds when a service that it depends on becomes unreachable, if your timeouts and retries are configured appropriately, etc. How does your microservices‑based distributed system deal with network partitions? Blackhole attacks are a great way to find out.

Conclusion

We’ve defined Chaos Engineering and shown how it can help you build a more resilient microservices architecture. We also discussed using blackhole attacks to see how services respond to external and internal dependency failures.

Blackhole attacks are great for seeing how resilient your service is to dependency failures, but there are also other very useful experiments you can do in a microservices environment. Shutting down hosts, adding latency or packet loss, breaking DNS resolution, and adding CPU or memory pressure are all great things you can do to test the resilience of your microservices. For more ideas of Chaos Engineering experiments you can perform on your microservices, check out our tutorials on the Gremlin Community page.

Want to try Chaos Engineering with NGINX Plus? Start your free 30-day trial today or contact us to discuss your use cases.

Retrieved by Nick Shadrin from nginx.com website.