Data Lake Unique Implementation Approach. Your Journey with Lingaro to the Cloud(s)
You have probably read dozens of theoretical articles on Data Lake introductions, best practices, best architectures etc. Lingaro believes that you learn best by acquiring knowledge from practitioners. We have been developing Data Lake infrastructures and migration procedures for many years, and have some advice on the process. In this article, we want to share our experiences and discuss ways of implementing Data Lakes in your company.
We will focus on:
- Data Lakes’ optimal architecture
- Lingaro’s architecture Data Lake systems
- Client benefits
Data Lakes’ optimal architecture
A Data Lake (DL) is a storage of exceedingly large amounts of structured, semi-structured, and unstructured data, like BLOBs, images, sound, SQL and other databases. There is no fixed limit of the size or file type in Data Lakes, the data can be stored in native formats. The data structure does not have to be predefined before it is activated for querying or analysis. Therefore, the design of the Data Lake handling system is crucial, as it can influence speed and performance.
There are quite a few Data Lake architectures present on the market. We believe that the best way to implement a cloud-based Data Lake project is to initially undertake an adoption analysis, develop a general strategy of the data migration (if e.g. the client plans to move from a Data Warehouse to Data Lakes), then focus on developing the Minimal Viable Product (MVP) and only then proceed to a more complex project, e.g. embracing the whole database (Fig. 1).
Lingaro’s Unique Data Lake Implementation Model
Business view of Data Lake transformation
The Minimum Viable Product satisfies critical business needs with sufficient functionalities. Its iterative building process allows for prompt feedback for further product evolution. Fast time to value is one of the key objectives and expectations.
In this phase, new data types are added, and more focus is put on common understanding, consistency, and the accuracy of data.
Based on the learning experiences new enhancements and features are proposed and implemented. Work is focused on Data Lake use case expansion and further adoption at the same time making sure that settled users are not impacted by the changes.
The sunset phase gives the opportunity to phase out the legacy systems that were part of the transformation journey.
Figure 1. Lingaro’s roadmap to Data Lake-based development/migration system.
This step is crucial, as one of the most dreadful mistakes a C-level decadent may commit is to underestimate a Data Lake project size, and try to develop/migrate the whole system all at once. Therefore, we recommend to first create an MVP and then test it with your data, to finally adapt the solution with all of the organization’s needs. If you plan to tinker with the whole system at once, then you risk disrupting your business’ reporting and other systems. Most often, the DW is not entirely moved to a new system, but partially and gradually migrated; it’s treated as a backup for the development of the new system. Such systems may even be a source of data for Data Lakes, until the developer decides to slowly back out of the most cost-generating and ineffective functionalities and repositories of DWs. Therefore, Data Lakes broaden DWs’ analytical capabilities. Some datasets present in DWs are better suited for Data Lakes, but they were originally placed in Data Warehouses, since there was no other possibility in the past. Traditional Data Warehouses, or Enterprise Data Warehouses, together with Data Lakes (and other sources, like streaming etc.) comprise Logical Data Warehouses.
Therefore, Lingaro first focuses on consultations and aims to understand our clients’ needs. Fast time-to-value of the initial MVP, and then continuous work on the Data Lake solution and its services, assessing the effectiveness of costs, finally lead to the implementation of the ultimate project (depending on the client’s needs). Very often the most optimal goal is a (temporal) coexistence of the Data Warehouse and Data Lake solutions, not a fast “lift-and-shift” migration to Data Lakes.
Lingaro’s architecture of Data Lake systems
Data analytics in companies may require accessing data using multiple systems or platforms. This might not only be time-consuming or slowing down more strategic decision processes, but it can also be quite costly. In order to drive more value faster from the data, we propose building a new, advanced infrastructure, based on cloud Data Lakes.
Data-Warehouse-based solutions most often deliver siloed data, which limit the performance of analytical applications. This may result in the development of insufficient and trimmed shadow BI solutions without a true high-level overview of the company’s business performance. By creating these shadow, underperforming BIs, different company departments are not able to integrate the data; ultimately a chaotic system is established with multiple versions of truth. Even though DWs are still mainstream, they do not allow for fast and efficient utilization of all new and old data sources – it may be hard to connect them to existing IT ecosystems and aggregate them with other data. Data Lake ecosystems on the other hand, may alleviate these problems, since they combine multiple data sources in one place, in a harmonized way and make them available to build solutions, functionalities in one unified toolset. This allows for fast and efficient decision-making based on updated and editable data of various types. Lingaro has experience in creating scalable, cloud-based, and platform-agnostic solutions, in the form of such technically challenging modules as ETL frameworks and Master Data Management tools.
Lingaro, recommends to start with design workshops in order to establish a common ground, to understand the business needs of our clients, and to determine the data patterns and solutions to be implemented in the new system. We opt for cloud-based Data Lakes, since they provide advanced data analysis and clustering, and efficient in-memory processing. Cloud-based functionalities allow for flexible scalability and performance that is difficult to attain with on-premise solutions. Additionally,
we understand that an independent and efficient system requires a detailed documentation, and therefore we always generate it.
Documentation is necessary, if our client’s IT team is supposed to be independent and have a clear overview of the company’s data processing logic, major data flows, and the business logic behind them.
We carried out various cloud-based data solutions including a global-scale Azure data deployment. Our solutions usually cover the following features and functionalities:
- A central repository for all the company’s data assets, including crucial ones regarding POS, shipment analytics, and store execution measurement.
- Improved Master Data Management quality with tiers allowing local master data control with comprehensive global reporting.
- Uploading and modifying local reference data for key business users via a web interface.
- Reference data harmonization expanding data analysis capabilities across internally and externally produced data sets.
- Accelerated creation of data marts and BI applications.
- Scalability, flexibility and great performance with a technology-agnostic platform.
Figure 2. Elements of a cloud-based Data Lake infrastructure offered by Lingaro.
A decent cloud-based Data Lake system is easily accessible not only by IT specialists, but also by other company employees and end-users. This requires developing a toolset and interface that makes crucial data and KPIs from different sources easily accessible in one view. This, in turn, supports faster, more advanced analysis and trend investigation, better higher-level decisions, and quicker reactions to changing market trends.
Data Lakes not only consolidate different data types into one repository, but also – if built properly – allow for fast creation of new BI applications.
This functionality is the first step in the process of developing an advanced analytics and global-scale data enterprise. The final system is able to combine filtered and secured data into regional data hubs and downstream applications. The data can be updated and make reporting, extraction, and analysis available for business users in one toolset. The updated data may be accessed with BI tools by any device, like laptops, smartphones, and tablets. This makes the decision process fast and swift, without technological bottlenecks.
Data Lake-based solutions comprise a data ecosystem, which is not only simple and scalable, but also leads to long-term cost saving on infrastructure/hardware and their support.
We have experience in handling Big Data of more than 25TB for 200+ users. This kind of data ecosystem would be costly to maintain on a premise-based DW – activating specific clusters for analysis and reporting is more optimal. After setting up a system handling the data, it became a central hub, easily operated by its business users, without the need of our further support.
Gartner has recently written that companies will soon be drowning in data if they do not devise a plan to deal with it. This means that IT professionals need to develop novel ways to prevent issues connected with the vast volume of data. Novel ways, like Data Lakes.