Lakehouses, a convergence of traditional data warehouses and agile data lakes, emerge as a promising new approach in data management. Yet, adopting this versatile framework may require strategically wading through uncharted migration paths.
What if you could have the best of both worlds — the structured, meticulous detail of a data warehouse and the flexible data storage of a data lake? That's the concept of a lakehouse — a solution that offers unprecedented versatility in managing your company’s data while not compromising on efficiency.
A lakehouse’s allure is undeniable, but it's the tale of the journey there that remains often left untold. Transitioning to a data lakehouse from an existing ecosystem is much like venturing into uncharted waters, filled with potential yet abound with challenges.
Successful transitioning necessitates careful planning, open-mindedness to adjustments, and a deep level of understanding of the existing and new architectures. In this blog post, we delve into two primary strategies to help illuminate the way, drawing insights from our hands-on experiences at Lingaro.
Key migration approaches
Transitions of this nature primarily encompass two methods:
-
Lift and shift: Migration of your existing setup can be appealing, particularly if you already operate an efficient Databricks system. This approach simplifies validation postmigration as it leaves the current framework largely unchanged.
-
Redesign and improve: More often, the introduction of a new platform highlights necessary improvements and modifications. Despite potentially increased risks regarding migration success, this method enables enhancements and optimizations.
Exploring the ‘lift and shift’ approach
Many clients initially assume that transitioning from Databricks Workspaces to Unity Catalog should pose no issues, given their coexistence on the same platform. However, while Databricks is primarily an application-running platform, Unity Catalog is just one service within this framework. Applications offer considerable flexibility, whereas the Unity Catalog only reveals its full potential when these applications comply with specific rules.
We have encountered several challenges across a variety of clients when considering the lift and shift approach. These scenarios could reveal whether these mirrors your circumstances. The optimal solution strongly depends on multiple factors, such as the size of the current data platform, the number of applications, and independent teams.
Scenario 1: "Our application stores Parquet datasets in a storage (without Databricks tables)."
In this case, here are your two potential options: Either convert data to a Delta Lake or register external tables on Parquet datasets in Unity Catalog. However, transitioning to Delta Lake might not be straightforward due to data type inconsistencies. If your upstream application has permission to write Parquet datasets directly to storage, then you’ve got this problem.
Furthermore, this might be also the case if applications have strong dependencies on the Parquet format. Even though Delta Lake stores data in Parquet files, it implements a different API, and some legacy technologies won’t be able to write or read it.
Retaining data in Parquet format while setting external tables on top of them doesn’t entirely address the issue. Unity Catalog itself doesn’t offer Atomicity, Consistency, Isolation, and Durability (ACID) transactions, a feature of Delta Lake. Choosing this route omits the ACID transaction support. Without it within the Unity Catalog, consumers cannot be truly independent from source dataset changes.
Scenario 2: "We're using All-Purpose Clusters; we’d like to keep it that way."
Migrating all-purpose clusters can be tricky, as there are several access modes available. Each access mode has unique properties — some facilitate access to Unity Catalog but may restrict other desired features.
The recommended setting for most cases is “shared,” but you must consider potential restrictions in third-party libraries and shell commands, for instance. This transition might disrupt your application’s functionality, as the legacy access mode granted more flexibility. Consequently, a thorough understanding of your current application landscape is vital for identifying potential redesign requisites and quantifying the necessary human effort.
Key takeaways:
-
Understand your current application landscape and Unity Catalog requirements.
-
Be ready for necessary changes in your applications.
-
Stay tuned to Databricks’ latest features and improvement updates.
Exploring the ‘redesign and improve’ approach
In our previous article, we explored how clients often adopt the lakehouse platform to refine their architecture by consolidating various data storage and processing layers. While this consolidation offers advantages, it's important to understand that data lakes and data warehouses typically don't operate in isolation. They interact with various upstream and downstream applications, adding complexity to the transition.
Redesigning and improving can seem like a low-hanging fruit, however, this might induce significant costs when changes extend to upstream and downstream applications. A lakehouse can indeed support multiple features. However, upstream and downstream applications utilizing the lakehouse may face challenges in using these features. Hence, while migrating your current environment and implementing these changes, remember that these applications may be affected.
A comprehensive approach is necessary for redesigning. Lack of such outlook might result in supporting multiple solutions concurrently. Here are a few scenarios to illustrate this point:
-
If you are contemplating about eliminating your existing data warehouse, remember that these solutions offer specific SQL syntax and additional features that Databricks does not support. Migrating SQL code from one solution to another may not be a straightforward task, particularly when the capacity or plans of the team overseeing the data warehouse are unclear.
-
If you are considering migrating your parquet datasets to Delta Lake while retaining your data warehouse due to compatibility reasons, note that not all data warehouses can read Delta Lake. You may find yourself in a situation where it's necessary to uphold old data processing pipelines that generate Parquet datasets.
Key takeaways:
-
Data lakes and data warehouses usually don't operate in isolation. Their interactions with various applications add complexity to the transition.
-
Redesigning systems might seem easy, but it can lead to high costs and challenges.
-
A comprehensive approach is needed to avoid compatibility issues, such as when migrating or maintaining different data formats.
Conclusion
Keep these four points in mind when planning to construct a lakehouse over your existing data platform:
-
Treat the lakehouse implementation as a typical migration project within your existing environment. Approaching it in an unrestrained, agile manner might result in adverse outcomes.
-
Prepare for necessary redesigns and ensure that the affected applications are well-informed so they can plan accordingly.
-
Keep documentations up to date and clearly delineate application dependencies. If this is not in place, begin implementing now.
-
Understand that each challenge demands context-specific solutions; a single ideal solution seldom exists.