How Can I Actually Avoid Siloed Data with a Data Lake?

Data Platforms

Data Lake Projects: An In-Depth Q&A Compilation by Lingaro

In theory, the advantages of a modern Data Lake over a legacy data warehouse are quite obvious. In practice, they are not. Many businesses’ data warehouses have been proven to work with their existing analytics workflows. The promise of improving these workflows with a new strategic approach is met with strong skepticism and detailed questions.

Fortunately, we have answers. We have years of cutting-edge Data Lake experience that includes global scale data solutions based on Microsoft’s Azure cloud platform.

In this blog series, we have compiled our answers to some of the most common questions we have encountered along the way. We hope you will find them useful, as you explore your own Data Lake opportunities.

Question

“How can I actually avoid siloed data with a Data Lake?”

Answer

“A Data Lake is superimposed on your existing architecture to streamline management of any type of data regardless of its physical location”.

In a large organization, data is typically siloed by:

Physical location, i.e. it is stored in various places or systems. Data Lakes do not require your data to be moved to a new physical location such as a centralized data warehouse. Such a move – especially when big data is involved – usually requires significant effort and financial resources. Instead, with Data Lake, your data stays where it is, and the Data Lake layer is overlaid on top of your existing architecture.
Type, i.e. it is not easy to integrate one type of data with other types of data. Data Lakes do not require any particular data structures to be in place. They can also be designed with metadata driven frameworks to make integrating additional data sets fast and easy.

Data Lakes are therefore a powerful tool to promote data democratization. With a Data Lake, data can be available from a single place to support decision making by any authorized person.

Furthermore, Big Data technologies use the “Schema on Read” approach, where you may still store raw data and integrate only those pieces which are common for many cases.

In a Data Hub phase, you decide how to integrate this data. Big Data allows you to work directly on the files and defines the data schema to “register” those files on the fly when you need to read them in a specific way (we may call this “lazy integration”).