Want to Build a Good Data Lake? Don’t Break These Four Rules!
For a modern business, data lakes offer a practical way to evolve their data and analytics environments.
Data lakes also offer the dream of revolutionizing those data and analytics environments. Usually this dream involves big promises of
- meeting all the current — and future — analytics needs of the entire organization and
- how – all by itself — a data lake can unite all an organization’s stakeholders around a shared data analytics vision.
Chasing this dream in your data lake project is an excellent way to ensure its failure.
To help ensure its success, keep your feet on the ground with these practical guidelines.
First, remember that a data lake is one unique piece of your larger data analytics puzzle
Data lakes are often marketed as standalone solutions, but in fact you will need to use them with other services and tools – like data warehouses – to achieve your business goals. Fundamentally, a data lake offers a powerful way to store your data. It won’t solve all your data analytics challenges.
Do not expect to build a data lake like you would a data warehouse, with star schemas, relations, and transaction-ready layers. A data lake is not a better version of a data warehouse. It’s a new addition to your solution stack. To get it right you’ll have to factor in a new set of variables.
Second, set a manageable scope for the project
Typically this means building a data lake for one business unit with narrowly defined users, use cases, and data sets. With this approach, you can focus your efforts on:
- building your data lake to accommodate the most relevant sources of truth within the business unit, and
- optimizing the data ingestion pipeline for a narrow range of metadata, data quality, and schema management.
Do not attempt to build a data lake that is everything to everyone in the organization. If you do, you can expect significant political pushback as well as technical challenges building semantic consistency for tremendous amounts of multi-type data from a wide variety of sources.
Third, clearly define the project’s success metrics upfront
Start with an assessment of your current state and realistic goals you would like to achieve in terms of information value, business value, and stakeholder value. Then map the milestones you will reach between here and there. This framework will help stakeholders understand where you have been, where you are going, and the business case for supporting the project and others like it in the future.
Do not follow a “build it and they will come” approach to data lake strategy. Stakeholders will not be blown away by your results – they will be gradually convinced one proof point at a time.
Fourth, keep your data lake compliant with your existing data management program
Follow your existing data policies to ensure that your DL remains manageable moving forward. That means – among other things — automatically archiving data to keep the data lake’s size reasonable and enforcing existing data access rules.
Do not make an exception to pre-set data guidelines. Your DL can flexibly ingest data of multiple types and qualities, but doing so indefinitely introduces significant performance, regulatory, and privacy risks.
- Your DL will be one component of your data analytics ecosystem.
- Set the scope of your project from the bottom-up.
- Stay compliant with existing policies.
- Be ready to demonstrate how your data lake fits in with – and is advancing — a larger data and analytics strategy. Build momentum with measurable successes.