Over the last couple of years, I have noticed more and more misleading information about Data Lake solutions. I heard that they are the most recent answer to all the problems which the traditional Data Warehouse was not able to overcome. Unfortunately, people who spread these rumors, forget to mention that this approach reveals other issues that our good, old DWH resolved by default, due to its design. The truth is that both solutions are optimized for different purposes.
The goal is to understand and apply them in the areas for which they are designed for. Nonetheless, if you have tones of data, which you want to use and distribute via reports or provide more advanced analytics you probably need both Data Lakes and Data Warehouse.
From this article you will find out:
How both solutions complement each other;
What are the most important differences between them;
When to consider each approach and how to benefit from them in the most efficient way.
The history of both solutions
The name “Data Lake” was probably used for the first time around 2005. Nevertheless, it was successfully used in many projects without any specific name for several years before that. Why was this solution not so popular back then? The most important reason is the fact that this was the middle of the decade, when the vast majority of data companies wanted to report from relational data.
Non-relational data was not so big in terms of volume, as social media or IoT was not as common as it is nowadays. On the other hand, the idea of Data Warehouses was introduced in the 1970s, so its origin is strictly tied to relational data. It is simple, the solution was designed to meet the challenges of its times. We could say, that DL and DWH operate (mostly, and most efficiently) on two different types of data.
Why do EDW projects fail?
How many failed implementations of Enterprise Data Warehouse projects have you heard of recently? Or, how many complaints have you encountered about somebody requesting to add new data into an already existing DWH solution? It took over a year to complete, the requestor no longer works in the company and no one really knows what the purpose of the entire operation was. This is because everything within the DWH needs to be well structured, so it takes a lot of time and effort to analyze and model what comes from the sources and then to incorporate any novelties before the data is delivered to the users.
Nowadays, we see that IT should be focused on delivering tangible results to businesses as soon as possible. Let’s have a look at the traditional process we follow with DWHs:
Analyze & Model – How to match new data with an already existing model? What questions will the user ask and how will they want to use the requested data?
ETL (Extract, Transform, Load) – It’s important to note that, the data can be used only after the Load phase is completed. The most complex part is the “T” – Transform, which takes time to implement.
Serve – When users see the results, we come back with feedback, which sometimes makes us go back to point 2 or even point 1.
As you can imagine, the whole process is quite time consuming. When properly designed and implemented, it is more solid than other solutions, but it takes time. Furthermore, in order to keep it reliable, you need to add the overhead of data stewardship and governance which are vital activities around any DWH.
How do Data Lakes tackle these problems?
First, Data Lakes are very flexible when it comes to data ingestion. All external sources may just put a file under a given location and the Extract and Load phases of the ELT (Correct, Load before Transform) are performed.
Second, in our previous chapter, we said that users cannot explore the data unless the Load phase is completed. Since we have ELT (not ETL) the time from submitting the request to having the data inside the DL is minimized. Users can explore it almost immediately! They can play with the data, without knowing the questions they want to ask up front. This brings new opportunities, because skilled analysts & data scientists may gather and experiment with very large datasets.
Due to the fact that Data Lakes are perfect candidates for MVP or PoC environments, you can postpone many phases of software development till the very last moment, until a business decides to use it. As for the Transform phase, Data Lakes may also help with offloading DWHs. This is most efficient when speaking of large-volume processing, so instead of using limited DWH resources, you may run the transformations using a Data Lake.
Since Data Lakes seem to be the perfect solution on their own, why don’t we just follow this approach?
Now we should mention all the challenges that Data Lake solutions bring. Firstly, in order to work on raw, or just-ingested data, you need to have skilled & properly trained users, who will be able to consume the data in this form. Moreover, Data Lakes lack semantic consistency and governed metadata which require some struggle when working with 3rd party tools and serving data to the end users.
This is one of the reasons why the exploration can be more complicated than just examining a view or reading a table in Data Mart. Secondly, Data Lakes will usually be inefficient when coping with multiple queries. Especially if we expect these queries to provide results immediately. DWH is the solution you want to use for such use cases. Finally, since RDBMS are a very mature solution, the security features in the traditional DWH are usually way more advanced.
Comparison of Data Lake and Data Warehouse
|Data processing||ELT, schema on read.||ETL, schema on write.|
|Data processing||ELT: Load first, then transform and shape (schema-on-read).||ETL: First create the schema, then transform the data so that it fits it, so that it can be loaded (schema-on-write).|
|Most efficient to work with||Non-relational data.||Relational data.|
|Security||Usually, a solution is based on file permissions or folder access. Ability to connect to Active Directory.||Mature solution, many years of applying complex features like row-level security, column-level security, dynamic data masking, fine-grained control.|
|Agility||Flexible, great for PoC and MVP.||Time consuming for changes, requires more analyzing& modelling.|
|Users||Limited capacity for concurrent queries, not very efficient for multiple users querying at the same time, best suited for data scientists.||Can cope with multiple queries, efficient for multiple users, ease of exploring data for non-advanced users.|
|Data volume& performance||More efficient for processing big volumes, especially column-wise. Possibly inefficient while operating on rows or performing drill through operations.||Not as efficient as Data Lake for processing big volumes, but more efficient for operations on rows or drill through operations and very complex SQL queries.|
|Resource consumption||Designed to easily scale in/out; best suitable for the cloud.||More expensive and limited in terms of scaling (scaling up/down).|
|Support for 3rd party tools||When deployed on-premise, it can be limited; but when deployed in the cloud, it gives a wide range of opportunities.||Rich ecosystem of well-tested tools (BI, monitoring, tuning).|
|Time to users||Exploration possible almost immediately after ingestion.||Takes time to bring new data scope into its structures.|
|Data exploration||Requires more programming skills but SQL interfaces are often available.||Applicable for self-service BI or WYSISWG editors, SQL as lingua franca.|
|Analytics||More suitable for one time or ad-hoc reports, well-suited for ML model training.||Usually easy to create recurring reports, applicable for refreshes or subscriptions.|
|Data retention||Can store a lot of data, archival& historical, just-in-case data.||Advised to store only data that is being used for reporting purposes, storage is costly.|
My boss will only have five minutes for me. How do I explain all this information?
Both Data Lakes and Data Warehouses are important parts of the data processing & reporting infrastructure. They are complementary approaches, not alternatives. Data Lakes are perfectly suited for the staging and processing layers. DWHs are rather a serving and compliance environment, the way you want to expose your data to the business users. You can look at Data Lakes as a more a technical solution, and DWHs as more of a business solution.
Data Lakes will optimize lots of time and resource consumption, they can be used to offload the DWHs whenever possible. Additionally, while using a cloud, you can scale out the computation cluster for the time of the heavy lifting and scaling it down (or even turn it off) when the processing power is not needed and save money. This is usually much more efficient than scaling up/down resources for a DWH.
Last but not least, Data Lakes will allow you to run cost-effective MVPs and decide what you really need to move into the DWH structures, keeping everything smooth and clean. With the old method, your analysts must know all the questions they want to ask upfront. The new one supports situations when they won’t know the questions yet, allowing them to play around and stumble upon some ideas they didn’t even know they had, giving you the competitive edge through creativity and innovation.