How Can I Efficiently Prevent Data Duplication?

Data Platforms

Data Lake Projects: An In-Depth Q&A Compilation by Lingaro

In theory, the advantages of a modern Data Lake over a legacy data warehouse are quite obvious. In practice, they are not. Many businesses’ data warehouses have been proven to work with their existing analytics workflows. The promise of improving these workflows with a new strategic approach is met with strong skepticism and detailed questions.

Fortunately, we have answers. We have years of cutting-edge Data Lake experience that includes global scale data solutions based on Microsoft’s Azure cloud platform.

In this blog series, we have compiled our answers to some of the most common questions we have encountered along the way. We hope you will find them useful, as you explore your own Data Lake opportunities. How Can I Efficiently Prevent Data Duplication?

Question

“If we load all the data, there is a high chance of creating lots of duplicates. Should we care?”

Answer

Yes, clean-up logic should be implemented if the goal is to have the Data Lake as a single source of truth.

The purpose of the Data Lake landing zone is to store raw data. At the beginning we may come across duplicates because they can come from various source systems, where the stored data was already duplicated.

It is important to understand, that Data Lakes still require supplementary services such as data catalog, master data management systems, and data governance policies. Without any data stewardship across Data Lakes, they can quickly become data swamps, which are difficult to navigate in and to perform meaningful data analyses providing business value; which should be the ultimate driver of their implementation. Proper naming conventions, metadata tagging, and data organization standards are one of the key factors for enabling quick implementation of downstream analytic solutions and to reach the full potential of Data Lake opportunities.

MDM solutions can help to resolve multiple source/duplicate information issues by persisting and applying harmonization rules which allow to resolve conflicts across records that originate from different source systems.