The co-existence of Data Lakes and traditional Data Warehouse ecosystems. Should I get rid of all commodity Data Warehouse solutions when moving to Cloud and Data Lake?

data warehouse vs data lake

Data Warehouses are still very popular, companies investing in their development suffer from growth management and others problems. As the volume and types of stored data grow along with the number of users, use-cases, and so on – DWs can have problems keeping up with the constantly increasing volume of information in multiple formats.

 

Between 30%-40% of data is siloed in legacy systems and is not being used for analytics.

39% of all data within enterprises is unused due to too high volume and restricted processing capabilities.

73% of all data within enterprises goes unused for analytics.

Therefore, we see common trends in business’ IT departments: migrating from on-premises Hadoop, DWs, and similar systems, to cloud solutions and more agile and functional Data Lakes; or building Data Lakes on cloud from scratch. In other words, companies overcome problems with insufficient IT infrastructure by migrating solutions to the cloud, and address problems with the growing amount of heterogeneous data by enriching the analytical ecosystem with Data Lake solutions.

 

Data Lakes, which can store various kinds of raw data – both structured and unstructured – were built initially with Apache Hadoop clusters on premises, are now often based in the cloud. This way companies may take advantage of their service-based flexibility, increased scalability, and better price-to-value coefficients, while not being forced to develop, organize, and maintain Data Lakes on their own. However, businesses usually decide not to forego Data Warehouses, and create a symbiotic solution of linking their old system with the new one. Therefore, it is crucial to remember that


there is no need to get rid of your Data Warehouse – you can integrate it with the Data Lake solution.

In this article we want to address a couple of topics related to the new trend of migrating to Data Lakes, or merging the old data storage solutions with new ones; thereby gradually stepping towards high-value Data Hubs:

Expert insights
“Data lakes are gaining popularity as ways to both conceptualize and manage data access and usage across an organization. However, data lakes do nothing to ensure that the data is accurate and connected with a single trusted reference point – such as what master data management enables. Without the contextualization and governance that master data management provides, a data lake is just that – a body of free floating, untreated data that can be accessed and used by anyone. This creates challenges with data literacy, in that while the data is available – it’s true usage and benefit becomes suspect without a centralized platform adding the quality that master data management enables.”

Comparison of Data Warehouses and Data Lakes

The most commonly known difference between Data Warehouses and Data Lakes is the fact that DWs store structured data and consist of relational databases; while Data Lakes are a storage of both structured, unstructured (like images, video, BLOBs, IoT data from wearables etc.), and semi-structured data (like some json files). Because of this, Data Lakes allow for broader and more flexible data exploration and are capable of analytical correlations across data points from various sources. However, this is only a small fraction of differences between on-premises Data Warehouses and cloud Data Lakes.

Data strategies have been impacted, as the cloud computing market has been swiftly evolving, and it  was predicted to exceed $200 billion this year. The constant influx of variously structured data is increasing in businesses, and therefore the demand for agile and flexible solutions for storing, analyzing, and reporting grows.

Comparison of Data Warehouses with Data Lakes

Usually, when we talk about Data Lakes, we mean a collection of data or a specific approach to data processing, not an independent data platform. Data Lakes may be built with various enterprise platforms, like relational database management systems, Hadoop, Google Cloud Platform, Azure Data Factory, Azure Analysis Services, and other services. This is where the second significant difference between on-premise Data Warehouses and cloud Data Lakes emerges:


Cloud Data Lake solutions uniquely detach the data storage functionality from the data computation functionality.


 

This may bring significant savings when it comes to storage and analysis of terabyte-size databases, since clients pay only for the computational power when they use it.

Expert insights
“Snowflake has seen and helped many organisations looking to consolidate both their on-premises analytical databases (data warehouses & data marts) and Hadoop data lakes and migrate them to the cloud.  To exploit all of the potential value in their data, organisations are looking to centralise all data  -regardless of structure- into a single cloud data lake capable of handling both schema-on-read and schema-on-write functionality, with flexible compute that scales to support multiple analytical applications.  Snowflake’s unique cloud data platform addresses these requirements, whilst providing a familiar database API delivered as a low management SAAS service.”

The final architecture of your DL depends on varying aspects, and the most crucial one is the current technology used in your company in relation to data. For instance, some users were using SQL databases, with SQL-based data exploration and ELT pushdown. Therefore, they may require the implementation of a relational Data Lake with relational database tools. In turn, most companies use Hadoop as their preferred data platform for Data Lakes, since it is capable of linear scaling, supports various analytic techniques, and may cost less than similar relational configurations. Yet the other business subjects might find Data Lakes implemented in a cloud service to be the most flexible, scalable, and agile.

The Data Lake approach is much more flexible than the traditional Data Warehouse infrastructure, since it does not require excessive preprocessing, cleaning, transformation or other type of data preparation. Data may be stored and provided for analyses in a raw, original state, or coming directly from the data source. The dataset cluster can then be activated and analyzed according to the future demand, it does not require to be active at all times, like the traditional data storage strategies. This fact can substantially lower data maintenance costs. At the same time, you do not lose the old functionalities, since


data stored in Data Lakes may be easily loaded into Data Warehouses or Data marts, or it can be directly consumed by analytical software, business intelligence, and others.


Expert insights
“It’s worth mentioning that when it comes to building analytics solutions in the cloud, “either-or” strategies don’t work best for data lakes or data warehouses. A data lake is one part of modern data architecture that allows more solutions to be built flexibly in the future.
Let’s take Azure Synapse Analytics as an example of a platform which combines a traditional SQL-like approach with modern Spark-driven architecture. Integrated analytics runtimes offer provisioned and serverless on-demand SQL Analytics offering T-SQL for batch, streaming, and interactive processing. At the same time, Spark runtime can be used for big data processing jobs with Python, Scala, R and .NET.
What’s the key element of such an approach? Both runtimes are data lake integrated and Common Data Model aware. In case you are curious, the Common Data Model brings semantic consistency to data within the data lake. When data is stored in this form, applications and services can interoperate more easily.”

 

Data Lakes may support multiple functions and interfaces. The growing trend is to use a single Data Lake as a new, better area for data landing and staging, and also to use it in an exploratory and discovery-oriented way (the so-called “analytics sandboxes”), in order to find some new and interesting data correlations. As the Data Lake approach does not impose the structure on the data, it allows for adding new datasets virtually on the fly.

Lingaro’s  Data Lake implementation ensures

Lingaro's successful data lake implementation ensures

Business Benefits

  • All data brought together;
  • Trusted, quality data;
  • High adoption among decision makers to support both daily and strategic decisions;
  • Democratized access to data & elimination of data silos;
  • Speed of getting data and insights (fast time-to-value);
  • Real-time decision analysis.

Lingaro's successful data lake implementation ensures

Technology Benefits

  • Smooth real-time dataflow
  • SQL and other languages supported
  • Scalable solution
  • Secure data and processing
  • Versatile (structured and un-structured) data.

 

Therefore, Data Lakes are not a revolution as much as they are an evolution of the existing technologies. They comprise multi-type data systems, and allow for the development of hybrid data ecosystems, suitable for example for multichannel marketing or analysis.

 

In the following chart you will find a brief comparison of on-premise DWs and cloud-based Data Lakes (Fig.1).

On-premises Data Warehouse Cloud Data Lake
Relational data Diverse types of data
Storage combined with computation Storage detached from computation
Majority of refined calculated data Majority of detailed source data
Entities and dependencies are known and tracked over time Entities are discovered from raw data
Data must be integrated and transformed upfront Data preparation on demand
Usually schema on write Usually schema on read
Limited scalability Scalability more adjustable
Integration with third-party software requires data transformations Easy integration with third-party software

Figure 1. Comparison of on-premise Data Warehouses and cloud-based Data Lakes.

 

migration from data warehouses to data lakes

Migration from on-premises Data Warehouses to Cloud Data Lakes

Re-platforming is not an unusual strategy in companies with DW deployed on their premises. More and more businesses move their databases to the cloud. What should they remember about when performing this important step?

While upgrading or migrating a DW you obviously need to plan and remember about the time span, risks and costs, business disruption, and the complexity of the whole undertaking. Not only data is moved to the new platform, but also its management and users. Many DW migration projects end up taking care of uncontrolled data marts or they simplify the vast number of databases by consolidating them into fewer platforms. Therefore,


cloud-based Data Lakes seem to be perfect candidates for data consolidation, since they are globally available platforms and may be easily centralized.

The ideal approach to implementation is to start small, preferably with a Minimal Viable Product (MVP). Start with a low-risk, high-value segment of work, dividing jobs into manageable segments, each having a technical goal and adding business value. If you start with a bloated project, you will probably be quickly overwhelmed with its size and complexity. A multiphase project plan will deal better with the incoming challenges. Focusing on a segmented dataset that is easily constructible and demanded by the business is often the best way to begin your data migration process. It will give the others the sense of prioritization and confidence to proceed to more complex data subsets.

Expert insights
“A data warehouse and a data lake complement each other. They do not compete directly, and one does not replace the other. Any real enterprise solution has a bit of both to some extent.
A number of ETL processes need to be revisited and maybe become ELT to leverage the performance of the data lake for processing. Aggregates from the data lake are fed into the data warehouse with analytics cutting across the entire data flow.”

Along the process of migration, you may stumble upon some failures. Therefore, you should plan contingencies for risky milestones, and preferably develop automatic testing and scripting for systems to increase the quality and to avoid migration problems. For some time, your DW and DL will function simultaneously, at least some of their crucial parts. The duration of this process depends on the complexity and size of your databases, user groups, and processes.

Migration does not only mean moving and consolidating your system elements. It may require development, especially if you have lots of uncontrolled processes in your databases. The “Lift-and-shift” strategy is sometimes possible, but at other times you may be forced to develop data models and interfaces to maximize performance on the cloud platform. The lack of backward compatibility may force your team to develop some specific components and routines, like stored procedures, and user-defined functions. Additionally, the quality of data and your previous models may influence your new platform; cloud solutions are not a magic wand, so try not to migrate your old problems together with your old platform.

Data migration does not require only architecture and data modelling experts, it’s important not to forget about data maintenance workers, such as database administrators and system analysts. Moreover, the migration will affect many elements and departments of your company – data modelling and analysis, reporting, dashboards, metrics, BI etc. Each of these elements may be generated or supported by a different business branch. Use it to your advantage and remember that


migration to the cloud may be an opportunity to improve the quality of your data that was previously generated in various departments of your company.


 

Therefore, consider if only the IT and managers should be involved in planning the modernization of your data warehouse, as other end-users and departments might also provide significant insights.

 

    Want to get great insights on Data Lake implementations?

    Download Lingaro’s complete Q&A compilation


    Related News