If you have been following the big data and analytics discussion for some time you no doubt have come across Hadoop, the granddaddy of infinitely scalable, collect everything databases. Hadoop, still available in open source, but now mostly used by businesses under a supported model, was one of the first widely used databases which allowed very, very, very large datasets to be collected and queried without relying on older relational database technology (think: Oracle, Microsoft SQLServer). Hadoop data was stored in a unstructured manner which meant that any and all kinds of information like web page logs, full text, images, and even video/audio, could be loaded into it’s very linearly scalable computing and storage environment.
Great! But Hadoop originally could only be queried through custom programming. There are now many front ends that mimic SQL queries for Hadoop – but as I will discuss later – why even bother? Hadoop is infinitely flexible, but that flexibility comes with a cost – specialized resources to manage it and limitations on the ease of querying what I argue 99% of my business customers want and need.
It may be interesting and profitable for an IT professional to learn the ins and outs of the Hadoop infrastructure and all its companion products (Spark, Hive, Flume, MapReduce, Impala, and others) but most business analysts and data scientists want the database to be an appliance. It's like having to learn how to put together a transmission because you want to drive a car down the road. Even further all the tools they know how to use natively talk SQL and easily connect to SQL data sources.
It is also important to take a step back and review the majority of data being collected and used by most businesses – it is structured and semi-structured. Sure, there are specialized instances where certain industries may need to analyze video streams or photos, but they are a very small subset of what our clients really want to do with their data. Customers, orders, products, web clicks, defects in manufacturing, marketing spend, clinical health, are all examples of structured and semi-structured data that would be easier to work with, both import, query and manage, with today’s modern cloud solutions like Snowflake, AWS RedShift, or Azure Synapse.
Each of the above-mentioned solutions handles close to infinite (or close enough for most businesses) of structured data. A few even handle both structured and semi-structured data. The big difference is that these tools act more like appliances.
Our clients don’t need to worry about clusters, hardware, scalability, backups, etc., that is all taken care of by the vendor. In addition, these solutions natively support SQL and have sub-second response times for even very large datasets. Less toys for IT to play with and more return on data for the business.
On a recent project Dunn Solutions was able to go from white board to complete data lake in 30 days with these products and that includes data ingestion. What it didn’t require was learning 6 different arcane languages and systems that needed to be spliced together and managed.
The bottom line is that with today’s super scalable cloud databases most businesses should bypass Hadoop and spend their energies creating value with their data.
Dunn Solutions loves big data, both structured and unstructured, but even more we love delivering business value from that data through machine learning, predictive models, and reporting. Please contact us if you want to partner on your big data initiative.