Jose Hernandez
1 Year Ago
I often hear (and read about) discussions around the implementation of a data lake to replace a data warehouse, or deciding on a data lake or a data warehouse? It’s not an either-or question. Data lakes and data warehouses serve similar purposes but for different situations and must also work together to support modern BI and analytics needs.
Before I tackle “how data lakes and data warehouses work together”, I’d like to set a baseline understanding. I’ll start with the data warehouse.
Understanding the Data Warehouse
The data warehouse is, the data repository you go to, for highly processed data (structured) that serves to measure business processes (think metrics and KPIs). While it does provide atomic data to support analysis of business metrics and KPIs, the scope of the data warehouse is measurement of those known and well understood business processes. That said, the data warehouse supports descriptive analytics, think dashboards and reports used to run the business; to ask questions such as; how did we do? How does this period compare to last period? At Dunn Solutions we have implemented hundreds of data warehouses that do just that. I’m not implying that the data warehouse is limited to descriptive analytics, but that is where the strength of the data warehouse lies.
Understanding the Data Lake
Now the data lake. The data lake is also a repository of information, but unlike the data warehouse, the data lake is a repository for any data, in any form (structured, unstructured or semi-structured) that has not been processed in any way. The data sits in waiting until someone comes up with a reason to use it! The point is, lots of diverse data, in its raw form, is available for many uses. Think about the data scientist who needs to find patterns to explain outcomes (using ML). Think analyst who needs to determine root cause.
Bottom line, the data warehouse supports one version of the truth, for KPIs and metrics, used to measure business processes while data lakes support data exploration, for various audience, of unprocessed data for insightful discoveries.
Differences between Data Warehouse and Data Lake
Let’s look at and review some of the differences between data warehouses and data lakes:
|
Data Warehouse
|
Data Lake
|
Data
|
Highly processed, structured data
|
Structured, semi-structured and unstructured
|
Processing
|
Schema on write
|
Schema on read
|
Data Store
|
RDBMS or columnar databases
|
Hadoop, S3, Blob Storage
|
Versatility
|
Less agile, requires lots of work to incorporate new data and new uses
|
Very agile, easily scales to support quantity
|
Audience
|
Business users, business analysts
|
Data scientists, analysts, researchers
|
How They Work Together
Now that I’ve differentiated the two, let’s visit the question, “how do they work together?”. Data warehouses and data lakes support different uses cases, but it’s not hard to imagine that there are use cases that need both; highly processed data and raw data. For example:
- IOT data stored in a data lake could be summarized in fact tables
- Enrich metrics and KPIs with supporting raw data that further describes the transaction.
- Forecast using more attributes than are available in the data warehouse
- Combine sentiment analysis with descriptive dashboards
Challenges of Combining Structured and Semi-Structured Data
There’s a reason I have not written this blog until now. Combining structured data and semi-structured data to perform analytics is challenging. It’s challenging from a user’s point of view, and from IT’s point of view. I’d like to borrow two paragraphs from a previous Dunn Solutions' analytics blog posting: Move Your Enterprise Data to a Cloud Data Lake and Cloud Data Warehouse Right Now! By Ashish Trivedi.
The challenge comes from both fronts. From your (the analytics user) point of view getting to the data and integrating it is difficult and time consuming. After you get access to the data, you are doing lots of manual ETL or ELT to get structured data and semi-structured data to a state that you can use for analytics and usually that means the creation of rouge databases and spread marts. Even if you’re very clever, much of this data preparation and integration must be done each time you need fresh data. That’s no way to be a data driven organization.
From IT’s point of view, the effort and cost of maintaining a scalable environment that can support data warehouses and data lakes in house is daunting. It’s also challenging to expose these various sources in a user consumable way. IT’s time is split between supporting users and business requirements and supporting infrastructure. To be clear, supporting users and the business requirements is far more valuable.
A few years ago I wrote about the benefits of cloud computing: You Can Start Your Big Data Project Tomorrow. Basically, I said, “build your repositories and analytics solutions in the cloud so you don’t have to worry about infrastructure!” But there’s more to it! What I failed to mention is that the benefit goes well beyond gaining scalable infrastructure, at least it does now!
Cloud Vendors: More Than Just Infrastructure
Dunn Solutions works with several cloud vendors (AWS, Azure and Snowflake) and they provide more than just infrastructure, they provide amazing features that allow the querying between data warehouses and data lakes!
- AWS offers Amazon Athena for in-place SQL querying of S3 data and Amazon Redshift Spectrum lets you do the same across S3 data sets and Redshift.
- Azure offers Polybase which allows you to query data in Blob storage, Hadoop and SQL Server (it also works with other RDBMSs too i.e. Oracle, Teradata and MongoDB).
- Snowflake allows you to store semi-structured data with structured data and query both with ANSI SQL.
So, what’s the benefit? The data in your data warehouse and the data in your data lake can work together to bring more insight (while the data stays where it’s at). Data scientist always want more data than exists in the data warehouse; problem solved. Dashboards can be enriched with social media (think sentiment analysis overlaid with descriptive analytics).
Contact Dunn Solutions for Help
Dunn Solutions' data warehouse experts, data lake experts and analytics experts are ready to work with you to determine how you can take advantage of the synergy between data lakes and data warehouses. Whether you want to start with a POC and set up a data lake or move your data warehouse to the cloud, or your already in the cloud with your data warehouse and data lake, Dunn Solutions' analytics experts can help you make the most of it with Snowflake, AWS or Azure.
Contact Dunn Solutions and let's discuss your data warehouse and data lake project ideas!