Over the years, it has been seen that data lakes are not entirely delivering on their promise. Thomas Hazel, founder, CTO, and chief scientist, ChaosSearch, explains why data lakes fail and how tech leaders can solve current challenges to make their data lakes more valuable.
When the term â€œdata lakeâ€ was coined more than a decade ago, it held much promise to deliver more organized, centrally stored, and accessible data to support analytics; unfortunately, the technology and architecture have delivered mixed results.
The promise of data lakes has never been realized for a few reasons, such as the volume, velocity, and variety of today’s data, outdated infrastructure that was not built for lakes, the need to constantly move and transform data in and out of the lake to support various analytics purposes, and limited resources to actually get the work done. Over time, new concepts and corresponding technologies have been created like the â€œlake houseâ€ and â€œdata meshâ€to help bring the value of data lakes to life, yet, none have truly addressed the original issues of time, cost, and complexity.
What Data Lakes Are Intended To Do
A data lake is a repository that houses data in its original, raw form (simple in). It helps eliminate data silos by acting as a single landing zone for data from multiple sources to ultimately provide operational and business insights (value out).
Data lakes promise key advantages over data warehouses, data marts, and traditional databases. While data warehouses ingest well-structured data that fit a predefined schema, data lakes ingest all data types in their source format. Data lakes make it easy and cost-effective to store large volumes of organizational data, including data without a clearly defined use case. Warehouses, marts, and databases require data to be structured and organized in particular ways, making them more complex, less flexible, and less scalable.
In short, when you have lots of structured, semi-structured, and unstructured data to manage or analyze, data lakes make it easy to do so. At least, that’s how they are supposed to work.
Why Data Lakes Aren’t Working
Right now, data lakes are only making good on one promise: storing data at scale in a flexible way. But in the end, that’s not delivering value to the business. For data lakes to actually make sense, organizations need to be able to analyze all the data at their disposal to make more informed operational/business decisions around product offerings, security, and performance.
Unfortunately, there are barriers to making this a reality, specifically:
- The growing volume and complexity of data. There’s simply more data than ever before; IDC projectsOpens a new window that the amount of data created over the next three years will be more than the data created over the past 30 years.
- The limited and outdated infrastructure these tools have been built on. Current data lakes are built on outdated technology that requires data to be moved and transformed to siloed solutions to be used for analytics.
- The data engineering/science skills shortage. Both data engineering and data science talent is in drastically short supply, making it harder for companies to transform resources into analytics that fuel business decisions.
- The costs to maintain modern-day solutions. Even open-source or cloud solutions can become sneakily expensive as data volumes scale, especially without the right in-house talent to maintain these solutions.
Even if Your Data Lake Is Failing, It’s Not Too Late To Fix It
After being built on old-school technology, today’s data lakes cannot meet analytics at scaling in a timely or cost-effective manner. To solve these problems, organizations need solutions that can modernize the way businesses build and manage data lakes, taking full advantage of the cloud and making cloud object storage effective for analytics at scale and real-time needs. These solutions should remove data silos and data pipelining requirements to build data lakes that cater to any and all use cases rather than design them to fit a certain range of needs. The modern data lake needs the performance of â€œschema on writeâ€ analytics while still providing the simplicity of â€œschema on read’ ingestion, automating each stage of raw data to insights.
Additionally, technology leaders should know and understand that data lake solutions don’t inherently include automation or analytic features. To â€œfixâ€ the data lake so that it delivers value to the business, it must combine intelligent cloud-based services that address the â€œtime, cost, complexityâ€ by automating data discovery, indexing, and transformation to ultimately provide analytics at scale with real-time or near real-time capabilities.
With the right platform and surrounding tooling in place, data lakes can:
- Centralize and simplify data analytics to enhance governance
- Improve access to more data to drive BI and ML analysis to uncover business insights
- More scalable log analytics to uncover potential security threats or performance issues
- Cross-pollinate relational analytics and search on all datasets to generate new, deeper insights
By investing in solutions that modernize cloud object storage and remove data silos, organizations can realize the full benefits of the data lake in driving greater data governance, enhanced security, and better business outcomes.