Data Lake vs. Data Warehouse: Is It Time to Ditch On-Premise Data Warehouses?

essidsolutions

As business intelligence (BI) and analytics move off-premise to the cloud, organizations realize that enterprise data warehouses are unable to meet operational demands.  With analytics workloads growing and evolving, IT teams must react by shortening software development and app deployment times. Is it time for IT leaders to re-think analytics budgets, move away from the warehouse and invest in data lakes? 

Data Warehouse Issues

With data volumes and velocities growing exponentially, companies are transforming their data architectures and pivoting to cloud processing to meet operational demands and achieve scalability. Part of this transition involves choosing cloud service providers for a combination of database, software and analytics services.

The shift stems from the fact that the on-premise data warehouse no longer serves the current needs. Data is moving to the cloud, and for performance reasons transaction and analytical processing needs to be on-platform or near-platform with the data.

Indeed, Gartner reportsOpens a new window that Oracle, SAP and Teradata have expanded their offerings in the past year, with IBM, Snowflake and Google not far behind. In Gartner’s 2020 surveyOpens a new window of 400 marketing leaders and analytics practitioners, contributor Gloria Omale notes that, “Fifty-four percent of senior marketing respondents in the survey indicate that marketing analytics has not had the influence within their organizations that they expected.” 

Lizzy Foo Kune, Senior Director Analyst at Gartner said that, “… [the] inability to measure ROI tarnishes the perceived value of the analytics team.”

 According to the survey, the major reasons why analytics is not used in informing decisions are:

  • Data findings conflict with the intended course of action (32%)
  • Poor data quality (32%)
  • Analysis does not present a clear recommendation (31%)

With results like this, it is no wonder that tech management is looking for alternatives to the data warehouse for its analytics.

Learn More: 3 Productivity-Killing Data Problems That Data Lakes Can Solve 

Enter Data Lakes

The data lake is a single repository that includes raw data from source systems. It can include databases, structured files, semi-structured data (such as XML, JSON, and so forth) and unstructured data (such as sensor data, log files, audio and video). According to a recent industry reportOpens a new window by Mordor Intelligence, The Data Lakes Market was valued at $3.74 billion in 2019 and is expected to hit $17.60 billion by 2025.

Several vendors have complete data lake solutions. Microsoft extended its Azure cloud offering with Azure Data Lake Storage. Oracle offers Oracle Big Data Services that include Hadoop-based data lakes and analysis through Oracle Cloud. Amazon extended its AWS service with AWS Data Lakes. Meanwhile,  Teradata Vantage works with data hosted by Amazon AWS, Microsoft Azure and Google Cloud.  

Finally, IBM has partnered with Cloudera to provide a set of open source data lake solutions as integrated technologies that allow a company to build and manage multiple data lakes for use at scale. The IBM solution is particularly interesting in its embrace of open source, following this new industry trend.

Some of the advantages of a data lake include:

  • Data retrieval speed is sometimes faster than a data warehouse, owing to transaction processing and analytics being close to the data (with both the data and software services deployed to the cloud);
  • Data warehouses usually require a significant amount of work by data scientists in extract-transform-load (ETL) processing, data cleansing and basic data exploration (according to a surveyOpens a new window by O’Reilly);
  • The proliferation of Internet of Things (IoT) devices is driving much of the growth in the data lake market, leading to an exponential growth in cloud services;
  • Being implemented in the cloud, data lakes can take advantage of low-cost data storage, leading to a lower cost of computing compared to an on-premise data warehouse.

Potential Data Lake Pitfalls

 Of course, no solution is perfect, nor does one data lake solution fit all companies equally. In 2018, Gartner published a white paperOpens a new window analyzing potential data lake failure scenarios. These scenarios included the following:

  • Some companies dive right into their first data lake project without considering standard data management best practices.  The initial intent of creating a single source for all analytics can run afoul of such issues as poor data governance, lack of performance tuning metrics and political challenges.
  • Some tech managers consider the data lake to be their own analytics platform and ignore or underestimate their own data management and data modeling knowledge. Part of the issue revolves around data lake data containing semi-structured and unstructured data, unlike the data warehouse.
  • Data growth across the enterprise can flood a data lake with old, outdated, irrelevant or unknown data. Raw data is sometimes missing or invalid (such as a RetireDate of “00/00/0000”). This rawness and the sheer data volume mean that standard warehouse transformation logic (the T of ETL) must be embedded in data lake queries, and performance suffers.

Learn More: Top 4 Considerations for Choosing a Data Integration Tool for WFH World

Key Takeaways

Implementing a data lake requires a complete data analytics strategy coupled with proper data management and governance. Begin your journey by investing in the following.

  • Ensure that you have a complete and up-to-date enterprise data model that describes all of your data. This includes not only files and databases but data sources from originating systems. These data can be semi-structured or unstructured, and therefore do not fit neatly into common data models. For example, a structured data element such as ProductNumber may have a clear domain (e.g., alphanumeric), entity integrity (such as uniqueness) and a common definition across multiple databases. However, consider a video clip. How can you describe it in a data model? Analytics is straightforward on structured data; however, writing SQL queries against unstructured data will be difficult.
  • Know upfront the value proposition of your data lake, both for the first few projects and for the near future. You will need qualified data science staff for both data storage and business analytics. Further, performance tuning and backup/recovery require the appropriate technical staff (or vendor support staff if you have implemented cloud services).
  • Data growth can flood a data lake and make it useless. Consider initially limiting the amount and type of data stored in the data lake.
  • Review your current analytics tools and consider upgrading them to handle the data lake. Changes in the tools may be required depending upon changes in the types of data (unstructured, etc.), physical location (multi-cloud or even a hybrid cloud combined with on-premise) and user community (ad hoc users, data scientists, expert analysts).

Learn More: The Role of Cloud Data Lake Platforms in Enabling TCO Optimization 

Summing Up  

Finally, keep in mind that any major data-driven project will take time and resources. As you move towards implementing your first data lake, it is still necessary to support mission-critical operational systems, including your data warehouse. Consider cross-training your data warehouse staff and analytics team in your data lake technology. Their closeness to the data and their understanding of the enterprise data model will serve you well in the data lake environment.

 Let us know if you liked this article on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We would love to hear from you!