Naive Automation and Its Failure Modes

essidsolutions

Automation is great. Arguably, the entire field of IT, ever since its earliest inception, has been about automating away human tasks. But even today, computers can cause errors and lead to mistrust through automation. How can we benefit from automation — in terms of raw savings, and through the removal of interruptions and distractions — without overloading operators with a series of alerts.

Automation is great. Arguably, the entire field of IT, ever since its earliest inception, has been about automating away human tasks. Even the word “computer” itself once meant a human being who would laboriously compute mathematical calculations. To help these human computers, and to avoid errors, there were any number of tools, from slide rules to Napier’s Bones to bound books of mathematical tables.

These helped, but there was still an enormous amount of tedious human work to be done. Mechanical computers were widespread from the late 19th century, only to be gradually displaced by electronic descendants from the 1960s.

Errors could and did occur in each of these eras, but electronic computers, with their speed and inscrutability, generated a certain amount of mistrust of the results. Even today, we still see the same sort of reaction. It is pretty rare for computers to make actual mathematical errors, although there was the famous case in the nineties of the Pentium floating point error. These days, mathematical bugs are more likely to be found in software, such as Microsoft Excel.

The Intel and Microsoft bugs are implementation issues, but most failures of automation are failures to specify the problem correctly. There is the classic story of the “Find & Replace” routine which, aiming to remove rude words from the text, ended up with “clbutticOpens a new window .”

This is what generally worries sys-admins and other people who have to worry about operating IT systems. Yes, it would be nice to automate away more of those annoying routine tasks. But it’s also a drag to specify in laborious detail exactly when and how the automation may be triggered, worrying about unexpected events that may occur during execution and making sure that the automation cleans up after itself, leaving a clean environment for the next run – or for human operators to decipher.

For this reason, most NOCs still have a shelf of runbooks somewhere (even if it is no longer an actual physical shelf). When something happens, operators can refer to the standard procedure and execute it by hand. Despite progress in runbook automation, people are still much better at recognizing exceptions and pausing execution before causing too much damage.

Reluctance to Automate Reduces the Effectiveness of IT

For instance, even when compliance jobs are automated using something like Chef InSpec, if the remediation is not also automated, at least for common violations, the result is simply to overload operators with yet another source of alerts.

Risk-averse reluctance to automate is compounded by the fact that fewer and fewer problems are covered by existing runbooks – whether automated or not – except at the highest and least-detailed or lowest and most granular level. This change lowers the return on the effort of developing and maintaining the automation itself.

Working in Layers: The Only Way to Develop Good Automation

Develop reusable building blocks that are as small, atomic, and reusable as possible, and build up from there. This makes it easier to deal with changes in processes or in the infrastructure to be managed by updating only one or two isolated components.

In turn, this shift away from monolithic runbooks also makes it easier to expose automation to a wider array of users – those “users” may themselves be pieces of software. An analytics package, possibly AIOps tools, may be able to offer users menus of “known-good” automated tasks, which operators can execute safely.

Over time, it then becomes much easier to close the loop and deliver true self-healing automation, by taking advantage of machine learning instead of trying to specify exhaustively every single starting condition that the runbook may need to deal with.

Assess your Environment and Identify Strong Candidates for Automation

Don’t try to do too many things at once, but pick a handful of possibilities that would deliver significant time savings if they could be automated successfully. Then start thinking more widely about how that automation would work.

This is the approach that will let you benefit from automation, both, in terms of raw time savings and through the removal of interruptions and distractions.

Having dealt with that first handful of your most annoying tasks, you will then have the time and leisure to consider your next steps – more automation, or perhaps something else that you had always meant to do, but never had the time for.

Isn’t that what computers were designed for?