The co-existence of Data Lakes and traditional Data Warehouse ecosystems. Should I get rid of all commodity Data Warehouse solutions when moving to Cloud and Data Lake?
Data Warehouses are still very popular, companies investing in their development suffer from growth management and others problems. As the volume and types of stored data grow along with the number of users, use-cases, and so on – DWs can have problems keeping up with the constantly increasing volume of information in multiple formats.
Between 30%-40% of data is siloed in legacy systems and is not being used for analytics.
39% of all data within enterprises is unused due to too high volume and restricted processing capabilities.
73% of all data within enterprises goes unused for analytics.
Therefore, we see common trends in business’ IT departments: migrating from on-premises Hadoop, DWs, and similar systems, to cloud solutions and more agile and functional Data Lakes; or building Data Lakes on cloud from scratch. In other words, companies overcome problems with insufficient IT infrastructure by migrating solutions to the cloud, and address problems with the growing amount of heterogeneous data by enriching the analytical ecosystem with Data Lake solutions.
Data Lakes, which can store various kinds of raw data – both structured and unstructured – were built initially with Apache Hadoop clusters on premises, are now often based in the cloud. This way companies may take advantage of their service-based flexibility, increased scalability, and better price-to-value coefficients, while not being forced to develop, organize, and maintain Data Lakes on their own. However, businesses usually decide not to forego Data Warehouses, and create a symbiotic solution of linking their old system with the new one. Therefore, it is crucial to remember that
In this article we want to address a couple of topics related to the new trend of migrating to Data Lakes, or merging the old data storage solutions with new ones; thereby gradually stepping towards high-value Data Hubs:
- Comparison of Data Warehouses and Data Lakes
- Migration from on-premise Data Warehouses to Cloud Data Lakes
Comparison of Data Warehouses and Data Lakes
The most commonly known difference between Data Warehouses and Data Lakes is the fact that DWs store structured data and consist of relational databases; while Data Lakes are a storage of both structured, unstructured (like images, video, BLOBs, IoT data from wearables etc.), and semi-structured data (like some json files). Because of this, Data Lakes allow for broader and more flexible data exploration and are capable of analytical correlations across data points from various sources. However, this is only a small fraction of differences between on-premises Data Warehouses and cloud Data Lakes.
Data strategies have been impacted, as the cloud computing market has been swiftly evolving, and it was predicted to exceed $200 billion this year. The constant influx of variously structured data is increasing in businesses, and therefore the demand for agile and flexible solutions for storing, analyzing, and reporting grows.
Usually, when we talk about Data Lakes, we mean a collection of data or a specific approach to data processing, not an independent data platform. Data Lakes may be built with various enterprise platforms, like relational database management systems, Hadoop, Google Cloud Platform, Azure Data Factory, Azure Analysis Services, and other services. This is where the second significant difference between on-premise Data Warehouses and cloud Data Lakes emerges:
Cloud Data Lake solutions uniquely detach the data storage functionality from the data computation functionality.
This may bring significant savings when it comes to storage and analysis of terabyte-size databases, since clients pay only for the computational power when they use it.
The final architecture of your DL depends on varying aspects, and the most crucial one is the current technology used in your company in relation to data. For instance, some users were using SQL databases, with SQL-based data exploration and ELT pushdown. Therefore, they may require the implementation of a relational Data Lake with relational database tools. In turn, most companies use Hadoop as their preferred data platform for Data Lakes, since it is capable of linear scaling, supports various analytic techniques, and may cost less than similar relational configurations. Yet the other business subjects might find Data Lakes implemented in a cloud service to be the most flexible, scalable, and agile.
The Data Lake approach is much more flexible than the traditional Data Warehouse infrastructure, since it does not require excessive preprocessing, cleaning, transformation or other type of data preparation. Data may be stored and provided for analyses in a raw, original state, or coming directly from the data source. The dataset cluster can then be activated and analyzed according to the future demand, it does not require to be active at all times, like the traditional data storage strategies. This fact can substantially lower data maintenance costs. At the same time, you do not lose the old functionalities, since
data stored in Data Lakes may be easily loaded into Data Warehouses or Data marts, or it can be directly consumed by analytical software, business intelligence, and others.
Data Lakes may support multiple functions and interfaces. The growing trend is to use a single Data Lake as a new, better area for data landing and staging, and also to use it in an exploratory and discovery-oriented way (the so-called “analytics sandboxes”), in order to find some new and interesting data correlations. As the Data Lake approach does not impose the structure on the data, it allows for adding new datasets virtually on the fly.
Lingaro’s Data Lake implementation ensures
- All data brought together;
- Trusted, quality data;
- High adoption among decision makers to support both daily and strategic decisions;
- Democratized access to data & elimination of data silos;
- Speed of getting data and insights (fast time-to-value);
- Real-time decision analysis.
- Smooth real-time dataflow
- SQL and other languages supported
- Scalable solution
- Secure data and processing
- Versatile (structured and un-structured) data.
Therefore, Data Lakes are not a revolution as much as they are an evolution of the existing technologies. They comprise multi-type data systems, and allow for the development of hybrid data ecosystems, suitable for example for multichannel marketing or analysis.
In the following chart you will find a brief comparison of on-premise DWs and cloud-based Data Lakes (Fig.1).
|On-premises Data Warehouse||Cloud Data Lake|
|Relational data||Diverse types of data|
|Storage combined with computation||Storage detached from computation|
|Majority of refined calculated data||Majority of detailed source data|
|Entities and dependencies are known and tracked over time||Entities are discovered from raw data|
|Data must be integrated and transformed upfront||Data preparation on demand|
|Usually schema on write||Usually schema on read|
|Limited scalability||Scalability more adjustable|
|Integration with third-party software requires data transformations||Easy integration with third-party software|
Figure 1. Comparison of on-premise Data Warehouses and cloud-based Data Lakes.
Migration from on-premises Data Warehouses to Cloud Data Lakes
Re-platforming is not an unusual strategy in companies with DW deployed on their premises. More and more businesses move their databases to the cloud. What should they remember about when performing this important step?
While upgrading or migrating a DW you obviously need to plan and remember about the time span, risks and costs, business disruption, and the complexity of the whole undertaking. Not only data is moved to the new platform, but also its management and users. Many DW migration projects end up taking care of uncontrolled data marts or they simplify the vast number of databases by consolidating them into fewer platforms. Therefore,
The ideal approach to implementation is to start small, preferably with a Minimal Viable Product (MVP). Start with a low-risk, high-value segment of work, dividing jobs into manageable segments, each having a technical goal and adding business value. If you start with a bloated project, you will probably be quickly overwhelmed with its size and complexity. A multiphase project plan will deal better with the incoming challenges. Focusing on a segmented dataset that is easily constructible and demanded by the business is often the best way to begin your data migration process. It will give the others the sense of prioritization and confidence to proceed to more complex data subsets.
Along the process of migration, you may stumble upon some failures. Therefore, you should plan contingencies for risky milestones, and preferably develop automatic testing and scripting for systems to increase the quality and to avoid migration problems. For some time, your DW and DL will function simultaneously, at least some of their crucial parts. The duration of this process depends on the complexity and size of your databases, user groups, and processes.
Migration does not only mean moving and consolidating your system elements. It may require development, especially if you have lots of uncontrolled processes in your databases. The “Lift-and-shift” strategy is sometimes possible, but at other times you may be forced to develop data models and interfaces to maximize performance on the cloud platform. The lack of backward compatibility may force your team to develop some specific components and routines, like stored procedures, and user-defined functions. Additionally, the quality of data and your previous models may influence your new platform; cloud solutions are not a magic wand, so try not to migrate your old problems together with your old platform.
Data migration does not require only architecture and data modelling experts, it’s important not to forget about data maintenance workers, such as database administrators and system analysts. Moreover, the migration will affect many elements and departments of your company – data modelling and analysis, reporting, dashboards, metrics, BI etc. Each of these elements may be generated or supported by a different business branch. Use it to your advantage and remember that
migration to the cloud may be an opportunity to improve the quality of your data that was previously generated in various departments of your company.
Therefore, consider if only the IT and managers should be involved in planning the modernization of your data warehouse, as other end-users and departments might also provide significant insights.