Hybrid Cloud Is All the Rage—What About Data?

essidsolutions

This article helps cloud architect and data engineers planning their hybrid and multi-cloud strategy build better architectures to manage vast amounts of data spread across many silos and make the most of their most useful and valuable data.

Hybrid cloud is suddenly all the rage. Enterprises and software vendors aren’t just talking about it—they’re making big bets on it. IBM recently acquired Red Hat and sees Red Hat as a key to hybrid cloud strategy. There have been a number of other small and large acquisitions and various partnerships to accelerate adoption and usage of hybrid environments.

But first, what does hybrid cloud actually mean? Red Hat defines hybrid cloud as “a combination of two or more cloud environments—public or private. It’s a pool of abstracted resources that could be developed partially from hardware owned and managed by a third-party company as well as hardware owned by the enterprise using the cloud.”

Further, it says that “these resources are orchestrated by management and automation software that allow users to access the cloud on-demand through self-service portals. And everything is supported by automatic scaling and dynamic resource allocation.”

Like this definition outlines, since it emerged, the cloud has been focused on the computing aspect mostly to run stateless applications—bringing flexibility, cost-savings and on-demand usage for enterprises. And Kubernetes is increasingly seen as a key technology to enable and drive hybrid cloud for computing resources with its ability to orchestrate computing environments across clouds, public and private.

Data in Hybrid Environment

So now, what about data? With the data revolution, a bulk of computing that happens today happens on data, whether it be simple reporting of enterprise performance, product usage, forecasting with models, or even more complex analysis like running machine learning on data to train models. And data isn’t really a “pool of abstracted resources”. In fact, data is everywhere, and more and more physically siloed across different racks, different regions and different clouds. What happens to data in hybrid environments? To get insights from data, data will eventually need to be close to where the data computing occurs – think technologies like Apache Spark, Presto, TensorFlow and others.

For computation orchestration, we have a sophisticated tool like Kubernetes – it grows and shrinks clusters as needed, across public and private cloud environments. But how do enterprises deal with their most important asset? They keep copying datasets needed across environments. Doesn’t it seem a little rudimentary?

Let’s take a step back and think about all the data being collected. According to IDC research, “By 2025, over 20 per cent of the data created in the global datasphere could be useful for analytics if only tagged.” If only 20% of data is actually useful, why are users copying data around that may not even be relevant? Not to mention that it’s also likely that the same set of data is being used over and over again.

In fact, wouldn’t the best way to determine the relevance and worth of the data itself be to have the computational framework pull the required data closer? Not by brute force, not by an admin copying data around, but by the questions being asked of the data itself.

Plus, based on the frequency or recency of data access (and maybe even another dimension), the most important data i.e. the “active dataset” should be closer to where the processing happens because it is highly likely that this data will be accessed again.

Data Orchestration

This type of data management approach—data orchestration—is similar to the Kubernetes approach for managing computing. The objective is to make data more accessible to compute no matter where the data is stored. In this case, data could be in S3 on AWS, or on-premises in HDFS, in another public cloud, or even in a third-party dataset that’s only accessible over REST. Orchestrating data across silos and clouds might be the most important piece of the hybrid and multi-cloud puzzle.

It is time that we think about hybrid cloud beyond just computing. For computing to drive value, you need data, and for the active, most important data to reach compute, you need data orchestration.