Ashish Trivedi
1 Year Ago
Combining structured data, semi-structured data to perform analytics is hard. If you’re like most, the data that you need and use every day are spread out throughout your organization. Some is found in your enterprise data warehouse, some in spreadsheets, and some in ODSs. This only accounts for structured data. Semi-structured data will be found in XML and JSON files, and some in text files like CSV. The problem is obvious, data is everywhere and integrating it for analytics (descriptive analytics, predictive analytics and prescriptive analytics) is difficult.
The challenge comes from both fronts. From your (the analytics user) point of view getting to the data and integrating it is difficult and time consuming. After you get access to the data, you are doing lots of manual ETL or ELT to get structured data and semi-structured data to a state that you can use for analytics and usually that means the creation of rouge databases and spread marts. Even if you’re very clever, much of this data preparation and integration must be done each time you need fresh data. That’s no way to be a data driven organization!
From IT’s point of view, the effort and cost of maintaining a scalable environment that can support data warehouses and data lakes in house is daunting. It’s also challenging to expose these various sources in a user consumable way. IT’s time is split between supporting users and business requirements and supporting infrastructure. To be clear, supporting users and the business requirements is far more valuable.
Cloud-based Data Warehouses and Data Lakes
Cloud-based computing and SaaS have really changed the way we look at infrastructure. Take a look at the blog Jose Hernandez, Director of Analytics at Dunn Solutions Group, wrote a couple of years ago to check out all the benefits of analytics in the cloud. Cloud-based data lakes and cloud-based data warehouses help overcome these challenges and several offerings from Snowflake, AWS and Azure let us set up these environments with relative ease and at a pay-by-the-drink cost structure. With all the benefits of these cloud-based data infrastructures, you should move your data there right now!
Let’s explore some of these challenges and how cloud-based solutions like Snowflake, AWS and Azure provide solutions.
Data Warehouses and Data Lakes
If the data were in the same place, meaning the same environment or from the same interface, that would make accessing much easier. Snowflake, AWS and Azure all support data lakes and data warehouses in the same environment. Data warehouses support one version of the truth; KPIs and metrics we use to measure business processes while data lakes support the flexibility to do insightful discoveries from various data sources. This means that users would have fewer places to go and look for data and can more easily combine data from data warehouses and data lakes. From an IT perspective, corporate data would be stored, managed and controlled in one place.
One of the most challenging aspects of doing effective descriptive, predictive and prescriptive analytics is integrating data from various systems in the enterprise. I’m not talking about the ETL used in data warehousing, I’m referring to integrating data on consumption. The most common tool business users use to integrate data is spreadsheets. While this is a resourceful use of spreadsheets, it is not optimal or without challenges.
Snowflake helps you handle this challenge by allowing you to use SQL to access structured data in relational tables along with semi-structured data (JSON and XML for example) using dot notation or by creating views on those semi-structured file types. Basically, with ANSI SQL you can combine structured and semi-structured data. Azure and AWS offer their own flavors of combining structured and semi-structured (and unstructured) data. Azure has Polybase and AWS has Athena.
The examples I gave here focus on the query side but Snowflake, Azure and AWS all have data movement and data integration tools that help with the orchestration and process of ETL or ELT. Snowflake has Snowpipe and works with many ETL tools. AWS has Kinesis (for real-time data streaming) and Data Pipeline for data workflow orchestration. Azure has Data Factory and of course SSIS. There are many options and not enough space on this blog to cover them, but Dunn Solutions' data warehousing and data lake consultants can help you flush through those details.
This is a game changer. One huge blocker for supporting analytics, especially predictive and prescriptive analytics is the time it takes to spin up the environments needed to support the analysis. Typically, data lakes are needed to support all types of data and in huge quantities. On-premise environments can take weeks or months to requisition the needed resources, set them up (configure the operating systems and needed software) and get them into the hands of the analysts. With cloud-based data lakes and data warehouses, you can have an environment of any size in minutes. Snowflake, Azure and AWS all support the spinning up and spinning down of environments on demand.
Scaling to the appropriate size is critical to your data warehouse and your data lake. On-premise environments are sized, purchased and implemented to support a projected need (both in storage capability and processing power). Typically this is done looking out 3 to 5 years. There is a large initial investment in the required hardware and then maintenance and support. Your ability to scale is limited to your overall environment limits. Cloud-based data lakes and data warehouses from Snowflake, AWS and Azure can scale up or down as needed. If you need a data lake to support a short-term need, say 3 months, spin it up and then shut it down when done. If your data warehouse has grown and needs more storage and more processing power to meet growing demand, add storage and processing power and have it ready in a matter of hours.
Pay for what you use
Bottom line, no huge capital expense is needed to get started. This is another area where cloud-based solutions such as Snowflake, AWS and Azure shine. There are nuances between these vendors on how they charge for using their data lake and data warehouse solutions, but typically you pay for what you use. Snowflake’s offering separates storage and compute providing the most flexibility and efficiency. All three also provide discounts if you agree to use a minimum amount of storage and compute.
Infrastructure Maintenance and Support
From an IT perspective, Snowflake, AWS and Azure relieves them of dealing with hardware and the required infrastructure to support the data lakes and data warehouse. It can then focus on supporting the user requirements, data integration, data integrity and data governance. It’s really a win-win for IT and the data consumers.
Given all these benefits, it makes perfect sense to move your data warehouse and data lake to the cloud. It does not have to be a big-bang approach, you can pace your migration to the cloud to meet your needs. I, and the rest of the Dunn Solutions' data warehouse and data lake experts can work with you to determine what works best for you and we can lay out a migration plan. Whether you want to start with a POC and set up a data lake or move your data warehouse to the cloud, Dunn Solutions can help you get there with Snowflake, AWS or Azure.
We would love to discuss with you your plans for moving your data lake or data warehouse (or both) to the cloud. Contact us today to start the conversation!