Four Ways You Can Monitor Large Volumes of Operations Data

essidsolutions

One of the main challenges for organizations of all sizes is monitoring the large volumes of data their infrastructure is generating. Arijit lists four best practices when monitoring large volumes of operations data.

As enterprises undergo digital transformation and are rapidly making updates to their infrastructure, it’s essential to adopt these best practices to monitor their entire environment. The modern software stack creates volumes of operational data that exceeds the capabilities of traditional tools; an approach to monitoring that uncovers important patterns in real-time and at any scale is our only chance at keeping up in an always-on, always-connected digital world.

Over the last few years, infrastructure and software development practices have rapidly evolved. Enterprises are adopting DevOps practices as they undergo digital transformation, which is enabling them to roll out applications to customers in a faster, more cost-efficient way. One of the main challenges for organizations of all sizes is monitoring the large volumes of data their infrastructure is generating. Below are four best practices for monitoring.

1. Consider the Number of Data Points and Time Series to Monitor

A common mistake organizations make is simply multiplying the number of servers they have by a factor – such as 100 metrics per server – to estimate the volume of data their monitoring system should be able to handle. Today’s environments have more monitored resources, and while microservices and right-sized instance count promotes component isolation, this also increases the amount of metrics emitted.

The democratization of scale-out architectures makes it difficult to anticipate the number of instances in the environment at any given time, and running multiple containers on top of each instance increases the number of time series being reported. In addition, enterprises instrument metrics that often have nothing to do with individual compute instances but are instead looking for signals that are critical to the overall health of the application or business. This includes per-customer metrics, or business KPIs, such as orders or queries, which significantly amplify the number of data points that monitoring systems ingest beyond raw instance count.

It’s also important to consider the number of time series in addition to the number of data points. Trying to render an excessively large number of time series directly affects the responsiveness and efficiency of your monitoring system. Even small organizations can have a large time series footprint, so it’s important for all companies to carefully build and select the metrics system necessary for your specific environment and use cases.

2. When Tracking Performance Over Time, Optimize Around Churn Instead of Preventing it

Organizations that track and examine historical data for trends unlock a number of valuable use cases. This allows them to plan capacity for future use, determine if their company is on track to meet its KPIs or to be alerted if a trend is anomalous so they can immediately resolve an issue. However, conducting a large-scale historical analysis of metrics is more difficult because analyzing performance over longer periods of time means querying and processing more data.

In addition, churn increases the number of unique time series that need to be processed and is especially common in containerized or auto-scaling infrastructure. Churn is when a time series is replaced by a different but equivalent one; while your dev team might not notice a negative impact at first, as more time series accumulate the system gets bogged down. Preventing churn doesn’t work because many teams don’t care about software versions and it’s difficult to enforce churn-free data reporting practices across an entire organization.

Strategies for optimizing around churn include cache queries, pre-compute service aggregates and pre-compute time rollups. Here’s how they function: cache queries quickly compute repeat queries; pre-compute service aggregates reduce the number of time-series queried for common use cases, and pre-compute time rollups reduce the number of data points.

3. To get a Holistic View of the Entire Infrastructure, Monitoring Requires Scalable Analytics

Different levels of an organization often have different priorities for monitoring and desire separate metrics. For example, a DevOps engineer might want to gather metrics on the performance of a specific service while a C-level executive would want to know about the overall application and business health. It would be disruptive and inefficient to use a different point solution for each of these monitoring use cases.

Scalable analytics enables companies to easily uncover insights at any organizational level. Scaling out enables a DevOps engineer to monitor 10 servers simultaneously as easily and quickly as 10,000 servers. Joining disparate data sets and analytics chaining enables you to view up, down and across your environment so you can combine partial results and build them up into the eventual value to monitor.

4. It’s Vital that the Monitoring System Provides Results in Real-time

Every organization should be able to flag and fix issues throughout their infrastructure before they impact customers. The ephemeral nature of environments driven by cloud infrastructure and containers, along with the added complexity of microservices, has made it less straightforward to identify and address issues. The process of sampling raw metrics, analyzing them and then sending alerts to a human operator can greatly lengthen time-to-resolution.

To produce faster results, it’s critical to measure metrics at a high data resolution that’s measured by the time interval between successive data points. Metrics can arrive in intervals as small as one second, or as long as five minutes or more; the primary factor in selecting the data resolution that works best for your business is the incoming rate of the data itself.

In addition to data resolution, analytics and alerting must also be fast. Analytics needs to be applied to metrics once they’re pulled to create aggregate or composite metrics that model important KPIs, so you can alert on those trends quickly. This needs to be done across the entire infrastructure. The faster your teams are alerted to trends or issues, the faster those issues can be addressed before they negatively impact service.