The Future of Hadoop in a Cloud-Based World

Hadoop once presented the promise of economical storage at massive scale, and streamlined processing of petabytes of data. As WANdisco CEO David Richards explains, though Hadoop took a big hit last year, it will stay with us for a while longer.

We’ve seen tectonic shifts in the big data industry this past year â€“ with some $18 billion worth of acquisitions in the data and analytics space including Salesforce acquiring Tableau, Google acquiring Looker, and CommVault acquiring Hedvig.

This wave of consolidation unquestionably signals a fundamental change in the outlook for Hadoop. Yet even given the recent roller-coaster ride of Cloudera, MapR, and other Hadoop players â€“ it’s too early to eulogize the platform. While Hadoop’s once superstar status is certainly diminished, its existence is not in question.

What is Hadoop?

Hadoop is a Java-based open source framework managed by the Apache Software Foundation, which was designed to store and process massive datasets over clusters of commodity hardware and leveraging simple programming models. Built to scale from individual servers to thousands of servers, Hadoop relies on software rather than hardware for high-availability â€“ meaning the system itself detects and handles failures in the application layer. Hadoop is composed of two primary components â€“ the Hadoop Distributed File System (HDFS) and Yet Another Resource Negotiator (YARN).

HDFS is the main Hadoop data storage system, which employs a NameNode/DataNode architecture to deliver high-performance access to data, in a distributed file system that sits on highly scalable Hadoop clusters. YARN, which was initially named â€˜MapReduce 2′ (as the next generation of the wildly-popular â€˜MapReduce’), helps schedule jobs and manage resources for all cluster applications. It is also widely used by Hadoop developers to create applications that can work with ultra-large datasets.

Learn More: Applying AI to IT Complexity: Top 5 AIOps Trends for 2020Opens a new window

A Brief History of Hadoop

Hadoop’s origins date back to 2002, when Apache Nutch developers Doug Cutting and Mike Cafarella started looking for more cost-effective project architecture to meet Nutch’s goal of indexing one billion web pages. Doug moved over to Yahoo! in 2006, in exchange for a dedicated team and resources that enabled him to turn Hadoop into a web-scale system. In 2008, Yahoo! released Hadoop to Apache, and it was successfully tested over a 4000 node cluster.

The year after, in 2009, Hadoop was successfully tested at petabyte scale â€“ handling billions of searches and indexing millions of pages in just 17 hours. That same year, Doug Cutting left Yahoo! to join Cloudera â€“ making that company the first Hadoop-dedicated company, with the goal of spreading Hadoop to other industries. Cloudera was followed by MapR in 2009 and Hortonworks in 2011 â€“ and Hadoop gained fast favor in Fortune 500 vendors who identified big data as a rapidly-developing and high-value field.

The Promise of Hadoop

The term â€œbig dataâ€ means different things to different people. Perhaps a better way of expressing it is â€œmore data with greater effect.â€ Because at some point, companies realized that all the data generated from their web and social media presences was either getting lost, or was just accumulating in expensive storage and serving little purpose. These organizations realized that this data could be used to create an improved and personalized user experience that would drive adoption and revenues. But they lacked the tools to do so cost-effectively at scale.

Enter Hadoop. This new technology promised economical storage at massive scale, and streamlined processing of petabytes of data. Thus, the idea of a company â€œdata lakeâ€ was born, and the era of effective Big Data processing began.

Learn More: A Computer In Every Ear: Hear the FutureOpens a new window

The Death of the Promise of Hadoop

When Hadoop was launched and gained popularity, it was an idea whose time had come. Finally, a cost-effective way to store petabytes of data at a price tag that was a fraction of traditional data warehousing costs

But then enterprises realized that storing data and using it were two entirely different challenges. Data began backing up in data swamps because organizations were unable to match the performance, security or business tool integration of their data warehouses â€“ which were more expensive but more manageable.

Despite the promises of companies like Cloudera, MapR and others to bring cloud-like flexibility to Hadoop, data architects began to rethink their plodding, massive data lakes. Cloudera and other Hadoop vendors responded to growing interest in cloud-based solutions with hybrid cloud and multi-cloud offerings like the Cloudera Data Platform (CDP), launched last March. Yet these were largely based on clunky â€œlift and shiftâ€ methodologies, whose efficacy and efficiency remain in question.

It was too little, too late. Hadoop vendors had essentially tried to create their own version of lock-in. Instead, they created a market. By trying to stymie innovation, they drove big data developers right into the open arms of specialized cloud-based big data storage, processing, and analytic services like those offered by AWS, Azure, and Google Cloud. These folks got used to the freedom, power and flexibility of cloud-based solutions. And now there’s no turning back.

Learn More: Investigating IT Dysfunction: How to Identify The Root of The IssuOpens a new window e

The Future of Hadoop: A Long, Slow Fade

Hadoop’s freefall last year is indicative of the industry’s ongoing transition away from tech of a different era. We are moving from on-prem storage and billions of batch-based queries to real time analytics over massive cloud-based datasets.

That said, Hadoop is not going to disappear anytime soon. The reality is that companies will need to find a way to make the transition by finding other options and rethinking a post-Hadoop world.

Hadoop-based data lakes will live on for years in industries where time-sensitive and insight-rich analytics are less important, and cost trumps efficiency. Hadoop will have its rightful place in the big data ecosystem. But for dynamic and fast-moving business landscapes, data management is going to be cloud-dominated, and organizations need to be planning their transitions today.

Data lakes are a thing of the past because data is not a static, closed body. We need to look at data as a river that can’t be dammed â€“ not a lake. It is in constant change because business doesn’t stop for migrations, upgrades or downtime. The context of data evolves minute-by-minute, and ensuring ironclad data consistency and usability â€“ not just filling up a reservoir â€“ is the true challenge facing data stakeholders.

Hadoop will eventually fade away because monolithic technological models inevitably wane in favor of their more dynamic offspring. People are hardwired for the freedom inherent in the cloud paradigm. Data should not drown in lakes; it needs to flow unhindered.

Let us know your thoughts on the future of Hadoop on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We would love to hear from you!