AIOps and Smart Alerting

essidsolutions

Along the way from signal to response, there are many opportunities for the application of intelligence. Within the context of artificial intelligence for IT operations, or AIOps, those signals take the form of logs, metrics, and event records.

From signal to response

The responses are typically remedial actions whether those actions are carried out by teams of human agents, robots, or some combination of the two; and the path between the two typically goes by way of monitoring systems, databases, analytical platforms, help desks, and alerting systems.

The last two years have seen vendors that supply each of these technology components seek to add AI functionality and even define AIOps in a way that places monitoring systems and alerting systems at the definitional center of AIOps. I believe, however, that these technology specific approaches to AIOps are fundamentally mistaken.

Instead, the effective deployment of AI in IT operations contexts requires an independent platform capable of interacting with all of these technologies that make up the path between signal and response. Why? Well, the main problem with the path as it is currently constructed is that using each of the technology components adds latency and much of this latency (although not all) is the consequence of the human intervention that its use requires.

The latency of insight and decision

Human intervention itself takes two forms. Insights based on input must be arrived at and decisions based on insight must be made. The process of going from input to insight and the process from insight to decision takes time. AI, by automating both processes, promises to significantly reduce that time.

If one were to automate the processes of insight and decision at each of the technology points along the way, some latency reduction could be achieved but, necessarily, there would be much duplication of effort and maybe even inconsistencies introduced (which would ultimately have to be resolved.) By treating the path from signal to response as an integrated whole and automating the end to end processes of insight and decision, a much more radical and far reaching latency reduction can be achieved.

Warping the five dimensions of AIOps

Now, artificial intelligence itself takes many forms. One can automate the processes involved in the selection of data from a tsunami of incoming signals. One can automate the discovery of patterns in the data selected. One can automate the process of drawing inferences from those patterns. One can automate the communication of the results of those of inferences. Finally, one can automate the execution of remedial responses.

One of the major ways in which technology-specific AIOps limits value is through the emphasis on some of the dimensions at the expense of the others. Vendors of monitoring systems, for example, tend to emphasize pattern data selection and pattern discovery at the expense of inference, communications, and remediation. Database vendors on the other hand focus almost exclusively on pattern discovery and inference.

Analytical system vendors tend to focus on inference and some aspects of communication, help desk vendors on remediation, and alerting system vendors largely focus on communication alone. Insight and decision making, however, involve the balanced choreography of all five dimensions of AIOps.

Another way in which technology-specific AIOps tends to limit or distort value is through its tendency to overstate the significance of the technology platform on which it is based at the expense of the other technology platforms.

Smart alerting platforms as an example

As an example, let us take a deeper look at how vendors of alerting systems tend to deploy AIOps. Historically, alerting systems have been built out of two basic components: a mechanism for actually communicating messages or signals to the appropriate recipients and a mechanism for authoring and enforcing rules that guide the delivery of messages or signals of specified types to recipients with the right characteristics. In general, the signals or messages came from outside of the platform.

Between 2013-2016, smart alerting technology vendors began to use communications-dimension based AI algorithms to do two things:

  1. Help determine an appropriate recipient of a signal or message and method and path of delivery based purely on the properties of the data constituting the signal or message; and
  2. To improve that determination of recipient, path, and method over time.

As interest in the application of AI to the entire range of IT operations functionality has grown significantly since 2016, these vendors naturally have attempted to extend the scope of their own ‘smart’ capabilities. This extension has taken two forms. First, they have attempted to push their data ingestion closer to the source, in effect bypassing monitoring systems, databases, helpdesks, and finding ways of getting system and network data directly into the platform.

Second, they have attempted to expand the scope of their own AI to include some modicum of data selection, pattern discovery, and inference functionality as an add on to the already deployed communications functionality.

Each form of extension has its problems. Unless the smart alerting vendor is willing to completely recreate the domain specific knowledge which informs application, infrastructure, network, and storage monitoring platforms, their own algorithms will be working with a highly uninformative (very noisy) data set. An AIOps platform vendor can instead exploit the work already done by the monitoring vendors.

This is not only a question of technology. Enterprises, with good reason, tend to shy away from a ‘rip and replace’ approach to new functionality deployment. Instead, they try to take advantage of existing technologies both to minimize the costs of disruption and also to avoid offsetting the advantages of a new functionality by simultaneously deploying an immature version of an older functionality.

With regard to extending into other algorithmic areas, the smart alerting vendor could, of course, recreate algorithms of the sort that it deploys (keeping in mind the many years of development and significant amount of IP protection that stand behind those algorithms) or it could seek to build out new algorithms from its existing stock.

The problem, though, is that the optimization of recipient choice and path selection is very different from the selection of significant data sets, pattern discovery, and inference. Furthermore, these platforms tend to be centralized when the emergence of modular, distributed, dynamic, and ephemeral architectures is dictating that the application of such algorithms take place close to the point where data is generated.

These considerations suggest that a more reasonable approach would be to deploy an AIOps platform to act as a bridge between monitoring systems and smart alerting platforms. This would allow the enterprise looking deploy AIOps to take full advantage of the richness of data coming in through the monitoring systems – with their richness of domain knowledge-based data selection and at the same time ensure this data gets appropriately analyzed and delivered to the smart alerting platform.

Enterprises have always aspired to integrate the various tasks and technologies that together constitute IT operations. The digitalization of business, however, has made that integration a necessity. AI indeed is a means of making that integration a reality but only if AI is deployed an even-handed way across all technology silos. Any deployment overly centered on a specific technology threatens to reinforce the fragmentation of IT operations — rather than mend it.