A data lakehouse represents a contemporary data architecture that combines the advantageous facets of both a data lake and a data warehouse. It could harness the strengths of these two paradigms and alleviate the inherent limitations of these technologies. While numerous online articles and white papers focus on explaining in detail what a data lakehouse is and its virtues, they often overlook the challenges involved in introducing and migrating to these platforms. These challenges and how to tackle them are what this blog post is about.
Among our clients, the primary motivations driving them to adopt lakehouse platforms include:
Simplified data platform architecture. A data lakehouse reduces complexity by serving as a unified platform that encompasses all essential data management components.
Cost reduction. By streamlining the intricacies of their data architectures, enterprises could curtail operational expenses.
Unified access layer and data governance. The semantic layer makes data access and analysis easier while also making it more feasible to enforce data quality, access controls, and data lineage.
Open-source foundation. Using open-source technologies allows for compatibility with existing solutions, which facilitates a smoother transition.
Simplicity and reach platform offering. The simplicity of a data lakehouse often garners favor among end users, which fosters higher adoption rates. It also offers a lot in terms of possibilities.
However, the charm of these drivers can sometimes prove deceptive. It is crucial to recognize that venturing into the realm of data lakehouses without thoroughly understanding it can lead to unforeseen challenges. This becomes particularly evident when transitioning from well-established data solutions to new ones while users are only accustomed to existing workflows. Evaluating the true cost of the extensive platform development project, migrating to that new platform, and adapting the integration methodologies for other applications already sound too complex to begin with.
Looking at the times we delivered data lakehouse projects to clients, we find that they commonly held the notion that their project would be much simpler than it actually was. When we revealed the complexities involved, they appreciated how we prevented them from making mistakes that could have delayed their project and would have been expensive to fix.
Relying on a single platform backed by a singular engine and storing data in a uniform manner are seldom the optimal choices for any use case, regardless of scale. If your organization’s data workloads primarily consist of high-throughput downstream queries and row-oriented, flat-list reports, there's a risk that you may not be satisfied with the performance or the final costs. Let's dive into the technical details and facts:
Data is stored in the columnar format within Parquet files. Note that Delta Lake is built on Parquet technology.
Accessing this data requires running a cluster that reads files containing the records from storage outside the cluster or retrieves them from a cache that is already loaded to the cluster but might require a refresh.
Each file should have a size ranging from 128 MB to 1 GB, which means retrieving information at the single-row level takes more time compared to, for example, an 8 KB page size in SQL Server.
While a single Delta Lake table can be optimized in various ways to facilitate faster data access, there's no one-size-fits-all solution.
It's important to understand that we store data in diverse formats and layouts and manage multiple indices for specific reasons. Some of these formats are designed to provide high performance, which is crucial for end-user applications, such as those working with relational databases. It's crucial to emphasize that Spark clusters are not a direct replacement for relational databases.
Takeaways:
A "one-source-of-truth" approach will not necessarily result in a single table that all consumers will use.
Don't assume that you can eliminate other serving layers. They might still be essential to handle high-throughput queries effectively.
Optimization techniques differ from what you might be accustomed to in a relational database management system (RDBMS), so careful planning is necessary.
Understand that this is not an RDBMS. Mechanisms that work there might not be directly applicable here, and performance expectations for single-row operations would vary.
It's often easier to propose that data should be stored in a single location and accessed in a uniform manner than it is to devise an effective strategy for managing clusters and controlling costs.
When navigating this terrain, you'll need to make critical decisions, including selecting the cluster size, determining the number of instances, and defining scaling parameters for both dedicated pools and serverless configurations. An ill-conceived plan can lead to performance issues or budget overruns.
Several factors must be considered to align your strategy with end-user needs, including expected throughput, acceptable warm-up time, and downstream performance expectations.
You can achieve significant cost savings by judiciously employing caching capabilities and routing similar queries to the same clusters. This approach counters the trend of extreme self-service and bringing your own cluster.
Cost separation and chargeback mechanisms are crucial elements that require the entire organization's awareness and participation. They are not without associated costs.
In one of our upcoming blog posts, we will delve further into the last two points.
Takeaways:
Make a concerted effort to assess and map out dependencies between data publishers and consumers. Identify critical areas for the business and prioritize requirements accordingly.
Begin by planning your cluster strategy to avoid incurring unnecessary expenses.
Regrettably, there is no universally accepted standard definition for atomicity, consistency, isolation, durability (ACID). Each database engine delineates its own specifics and corner cases, making it challenging to ascertain whether ACID transactions are present or not.
The primary challenge arises from a lakehouse's ability to accommodate various applications simultaneously without affecting one another, provided everything is well-planned. This distributed environment often requires one consumer to access multiple datasets owned by different publishers. Consequently, if your definition of “consistency” implies that all source tables must be “consistent,” it becomes exceedingly challenging, especially without additional system support, to determine the right moment to orchestrate everyone.
Consider the scenario of a single publisher needing multiple transactions to fully prepare their dataset for consumption. A lakehouse does not allow the combination of multiple transactions into a single atomic operation. Consequently, a publisher may encounter errors during processing, leaving the table in a partially ready state. Communicating this to downstream systems to prevent their use of incomplete data becomes a must. The question is, how should this be done?
Takeaways:
It's imperative to delve deeper and verify whether the ACID properties provided by the lakehouse platform align with your specific requirements. Avoid making direct comparisons with ACID in your existing RDBMS.
If you opt for full self-service and independence among various applications, remember that it comes at a price.
In upcoming articles, we will provide insights on crucial considerations for planning a lakehouse within your organization’s existing environment and delve deeper into why lakehouses pose organizational and technical challenges. We will also present a case study that illustrates how making the right decisions at the right time can ensure the success of your project.