Data Lake FAQ: Loading Data From a Data Warehouse

Data Platforms

Data Lake Projects: An In-Depth Q&A Compilation by Lingaro

In theory, the advantages of a modern Data Lake over a legacy data warehouse are quite obvious. In practice, they are not. Many businesses’ data warehouses have been proven to work with their existing analytics workflows. The promise of improving these workflows with a new strategic approach is met with strong skepticism and detailed questions.

Fortunately, we have answers. We have years of cutting-edge Data Lake experience that includes global scale data solutions based on Microsoft’s Azure cloud platform.

In this blog series, we have compiled our answers to some of the most common questions we have encountered along the way. We hope you will find them useful, as you explore your own Data Lake opportunities.

Question

“Should all my data warehouse’s structured, pre-calculated and pre-aggregated data be loaded to the Data Lake’s landing zone?”

Answer

Possibly. Doing so will probably mean lower setup costs and higher maintenance costs. Not doing so will probably mean higher setup costs and lower maintenance costs.

Aggregates and pre-aggregates can be moved to a landing zone and be treated as raw data in the Data Lake. This solution is faster and cheaper to implement, but it may carry additional maintenance costs. The disadvantage of this approach is that correct data linage may be impossible to achieve, and the transformation logic will be maintained outside the Data Lake.

The second option is to import raw data and do the aggregations on the Data Lakes’/cloud side. The pros are that the logic will be stored in one place, data linage will be correct, and we can use all the cloud scaling power to speed up the process. The cons are that it will take more effort at the beginning and will cost more during initial implementation.

Back
to Top

Data Lake FAQ: Loading Data From a Data Warehouse

Data Lake Projects: An In-Depth Q&A Compilation by Lingaro

Question

Answer

Want to get great insights on Data Lake implementations?