Here’s Why Microservices Desperately Need Service Mesh Anomaly Detection


In this article, Yuval Dror, Director of Engineering and Head of DevOps at Anodot talks about the rise of microservices – a service-oriented architecture (SOA) style approach to software development, and the need for AI-based anomaly detection solutions to detect real-time incidents and reliably reduce time to resolution.

The Rise of Microservices

Many companies today have adopted the new norm of rapid iteration in software development and now live by Mark Zuckerberg’s famous motto: “move fast and break things.” This mentality has led to the growth in popularity of a service-oriented architecture (SOA) approach to software design.

Microservices – one variant of the SOA approach – breaks up the application into smaller, specialized parts. The approach offers several advantages such as reducing risk, speed of deployment, and scalability; but it also brings with it it’s own set of unique challenges.

As software development teams are often deploying tens, hundreds or even thousands of features each day, one of the main operational challenges with microservices is to make sure that new features are not breaking anything within the microservices and more importantly, to make sure that a change to one microservice does not break other, dependent microservices.

In this article, we’ll discuss one of the technologies used to address this complexity: anomaly detection for service mesh.

What is Service Mesh?

Service-oriented architectures require dedicated tools that control service-to-service communication. In particular, as network communication between microservices grows in scale and complexity, it becomes impossible to manually manage deployments, troubleshooting issues and maintain cluster security. Service mesh technologies give you an additional layer of insights, improve observability, traffic management, deployment management, and enhance security within the mesh. Many tools and standards are created to address the service mesh complexity, such as:

OpenTelemetry: OpenTelemetry describes itself as an open-source observability framework. In particular, it provides a single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application

Envoy Proxy: Originally built at the company Lyft, Envoy is an open-source edge and service proxy that is designed specifically for cloud-native applications. They set out to solve two of the main issues with microservices that we’ve discussed: networking and observability.

Prometheus: Prometheus is another open-source solution for event monitoring and alerting. It collects real-time metrics from configured targets, evaluates rule expressions, display results, and can trigger alerts.
Learn More: Fast-Track Your Software Translation and Localization

Drawbacks of the Service Mesh Monitoring Paradigm

One of the main issues with service mesh monitoring tools is that when you have a large number of microservices, observability is unrealistic and impractical.

In the current paradigm of service mesh monitoring, the tools have some components that are responsible for meeting the service level agreement. For example, the service mesh Itsio collects the following types of measurement in order to provide overall service mesh observability:

Metrics: these are generated based on the Envoy proxy statistics, defined by Istio as the “golden signals” of monitoring (latency, traffic, errors, and saturation)

Distributed Traces: Istio also generates distributed trace spans for each service

Open source projects like Istio are very useful at collecting metrics that allow developers to create dashboards. This process works well if you’re dealing with a smaller application and there are dedicated teams monitoring and adjusting alerts. If you’re working on a project with large-scale deployment, however, these manual processes are much less effective.

Without the ability to visually monitor multiple clusters, service mech technologies need to go beyond “observing” and move towards automated anomaly detection.

Learn More: Microsoft Azure: The Technology That Transformed the Last Decade for Managed Service ProvidersOpens a new window

Anomaly Detection for Service Mesh

Anomaly detection that employs machine learning has many benefits over traditional monitoring methods, such as automatically learning the behavioral patterns of each new microservice and automatically sending alerts when significant changes are detected. These features allow you to lower the time it takes to detect anomalies and helps prevent further distribution.

AI-based anomaly detection integrates with the service mesh as a whole in order to track high-level KPIs as well as the most granular signals from each microservice.

Anomaly detection for service mesh monitoring is still an emerging field, although if you’re reviewing the available solutions here are a few considerations to keep in mind:

Fully Autonomous: As mentioned, the service mesh of large-scale deployments is impossible to monitor manually so the first consideration to make is to ensure that the solution can independently track and learn from data in real-time.

False Positive Rate: Next, you want to look for a solution that has a low false-positive rate as otherwise this can lead to unnecessary noise and create alert fatigue.

Correlation: Finally, an AI-based anomaly detection solution should be able to automatically learn the topology of the mesh and connect the dots.
Let’s now look at a real-world example of service mesh monitoring and anomaly detection.

In the screenshot below, you can see how Istio and an anomaly detection solution work together to improve the time to detect and help prevent distributions of significant changes. In particular, we see the anomaly detection solution takes time-series data as its input and learns the behavioral and seasonal patterns over time.

Source: Anodot

In the example above we see that there is a “latency issue detected” on one microservice, which is denoted with a (1). This latency issue then causes another microservice to queue the error message “http errors detected”, which is denoted with a (2). As soon as this anomaly is detected the solution then sends an automatic alert to the client through email, Slack, or Webhook.

With an anomaly detection solution, you not only get alerted about critical incidents but can also see a chronological list of corrected anomalies. This means you can easily trace back to the root of the anomaly to ensure it doesn’t happen again.


Anomaly Detection for Service Mesh As we’ve discussed, service mesh monitoring has become an essential part of managing microservices as they provide insights into service-to-service communication. As the deployment of microservices starts to grow, however, the observability becomes increasingly impractical.

Pairing service mesh technologies with an AI-based anomaly detection solution solves this challenge by enabling you to detect real-time incidents and can reliably reduce your time to resolution.

Let us know if you liked this article on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We would love to hear from you!