Service outages, a harsh reality that enterprises have to deal with, can disrupt the modern IT environment critically. Integrating AI and monitoring tools can help IT teams deal with the outage issues, writes Mohan Kompella, VP of product marketing at BigPanda.
You’ve probably seen it many times over. Drew Barrymore’s Casey Becker makes popcorn in a stovetop popper while she picks up a mysterious phone call in the opening scene of â€œScream.â€ The call takes a grim turn, as Becker soon realizes the voice on the other side of the line can see her.
As the call takes dark turn after dark turn, Becker tells the caller her boyfriend will be there any second, only to watch the patio lights flip on, revealing her boyfriend tied to the chair.
The genre-redefining flick was scary, but scary things don’t just happen in the movies. On the contrary, plenty of companies also encounter frightening service outagesâ€”things that keep their IT operations (IT Ops), and network operations center (NOC) managers awake, no matter what time of year.
IT Outages Cause Real-Life Horror Stories
Last year, British AirwaysOpens a new window experienced a horrifying IT outage. The outage led to one hundred canceled flights and hundreds more delays. While the team there figured out the cause of the outage, the airline giant reverted to more manual backup processes. This is the third time in the last few years the airline has experienced such a disruption.
Of course, that’s not the only high-profile company to experience a horrifying outage.
People the world over were left without their â€¦ gasp â€¦ social media for hours when Facebook suffered an outageOpens a new window that limited access to Instagram and WhatsApp, as well. It turned out the hosting platform encountered a bug in its programming, which ultimately impacted all of these different sites and applications. What made this outage particularly frustrating was the ripple effect it had on businesses that rely on social media for advertising and promotions. This type of prolonged (22 hours) inaccessibility caused tens of thousands of dollarsOpens a new window in losses for some small business owners and significant declines in click and engagement rates.
Unfortunately, these companies aren’t alone in experiencing unplanned downtime. Today, for most enterprises, the question is not â€œifâ€ but â€œwhenâ€ they will experience an outage. In 2014, Gartner estimatedOpens a new window that the average cost of downtime was $5,600 per minute, a number that’s surely since ballooned. Not only do companies face real monetary costs associated with downtime, but they also must contend with the pain these outages cause for end users and customers.
Enterprises that wish to limit the impact of potential outages must examine the causes behind them. Large companies are modernizing rapidly, moving to the cloud or even multiple clouds. This migration is causing overwhelming changes and ever-changing shifts in application technology, and it’s creating an increasing amount of noise that IT Ops, NOC and DevOps teams must contend with regularly. No matter how skilled or how big these teams are, they simply cannot scale their incident-response processes fast enough.
As the duration, frequency, and impact of outages and incidents mount, these teams need a way to quickly identify the root causes of outages and resolve them quickly in order to avoid a horror story like those we’ve seen far too many of.
AI Isn’t a Panacea, but It Can Help
Advances in technology, particularly in the realms of artificial intelligence (AI) and machine learning (ML) mean that there are some very viable tools and technologies out there to help teams make sense of all the changes. But leveraging AI isn’t the only way to address today’s modern IT environment. Here are four things you can do now to help make 2020 horror-story free.
1. Embrace your tools
You’ve likely invested plenty into monitoring tools. These platforms generate plenty of data, sure, but you still might not be getting the insights that you want. You may feel pressure to retire these tools or switch your approach. Stop, breathe, relax. Rather than panic-purchase another monitoring tool, trust the data you have and first start looking for ways you can get more from your existing tools.
2. Embrace AI and ML
AI-infused technology isn’t going to fix everything for you, but it can process far more data than humans alone can. These types of technologies are really well suited for IT Ops data and other forms of information that require real-time processing to separate the signal from the noise in predictive fashion.
3. Leverage a reporting tool
You can’t improve what you can’t measure. If you’re using manual spreadsheets or general-purpose tools, you’re probably wasting time or struggling to create reports and benchmark your current technology state. IT Ops reporting tools have matured, allowing for purpose-built for IT Ops reporting. That means now is the time to find a solution and start benchmarking your IT environment’s change rates.
4. Evaluate your team organization
No matter what tools you have in place, you’ll struggle to keep up with incidents if your IT team’s organizational model isn’t quite right. In a new environment, new structures might help. Centralizing performance management roles, for example, can help create a nexus around which the rest of the IT organization revolves, ensuring everyone is aligned in their incident-response strategy.
Embrace Change, Avoid Nightmares
In an ever-changing IT world, it’s high time your organization makes some changes, too. While you’ll never be able to avoid every single incident that might crop up, with the right people, organization, and tools in place, you’ll be able to mitigate the impacts of such outages and avoid lengthy horror stories. IT teams are crucial drivers of any digital transformation initiative, so put them in a place to succeed, and help your organization separate itself from the competition.