By Robert Whelan, 2nd Watch
With the typical enterprise using hundreds of SaaS solutions, each with its own database, it’s no wonder business leaders complain their data is siloed. Imagine now that a CEO wants to understand the relationship between data in these disparate systems. Suddenly, she’s looking at the world’s most confusing dashboard, all the while wondering: Can I trust this information? The CEO placates herself with that knowledge that at least she has data to look at, but in the end, it creates more questions than answers.
If you’re in a competitive industry (which we all are), it’s high time that CEOs take their data analysis to the next level. How to do it? Three data complexities are at the core of every leader’s challenge to gain business advantages from their data:
Do you have trouble seeing your data at all? Are you mentally scanning your systems and realizing just how many databases you have? An enterprise organization I know was recently collecting reams of data from their industrial operations but couldn’t derive the data’s value due to the siloed nature of their datacenter database. The data couldn’t reach any dashboard in a meaningful way. It is a common problem. With enterprise data doubling every few years, it takes modern tools and strategies to keep up.
For the company referenced above, they began the process of solving this problem by defining the business purpose of their industrial data – to predict demand in the coming months so they didn’t have a shortfall. That business purpose, which had team buy-in at multiple corporate levels, drove the entire engagement. It allowed them to keep the technology simple and focus on the outcome.
One month into the project, they had clean, trustworthy, valuable data in a dashboard. Their data was unlocked from the database and published.
Siloed data takes some elbow grease to access, but it becomes a lot easier if you have a goal in mind for the data. It cuts through the noise and helps you make decisions more easily if you know where you are going.
Do you have trouble trusting your data? You have a dashboard, yet you’re pretty sure the data is wrong, or lots of it is missing. You can’t act on it because you hesitate to trust it. Data trustworthiness is a prerequisite for making your data action oriented. But most data has problems – missing values, invalid dates, duplicate values, and meaningless entries. If you don’t trust the numbers, you’re better off without the data.
Data is there for you to act on, so you should be able to trust it. One key strategy is to not bog down your team with maintaining systems, but rather use simple, maintainable cloud-based systems that use modern tools to make your dashboard real.
Often you don’t even have the data you need to decide. “No data” comes in many forms:
- You don’t track it. For example, you’re an eCommerce company that wants to understand how email campaigns can help your sales, but you don’t have a customer email list.
- You track it but you can’t access it. For example, you start collecting emails from customers, but your email SaaS system doesn’t let you export your emails. Your data is so siloed that it effectively doesn’t exist for analysis.
- You track it but need to do some calculations before you can use it. For example, you have a full customer email list, a list of product purchases, and you just need to join the two together. This is a great place to be and is where we see most companies.
That means finding patterns and insights not just within datasets, but across datasets. This is only possible with a modern, cloud-native data lake.
The Data Lake
Step one for any data project - today, tomorrow, and forever – is to define your business need.
Do you need to understand your customer better? Whether it is click behavior, email campaign engagement, order history, or customer service, your customer generates more data today than ever before, and the data can give you clues as to what she cares about.
Do you need to understand your costs better? Most enterprises have hundreds of SaaS applications generating data from internal operations. Whether it is manufacturing, purchasing, supply chain, finance, engineering, or customer service, your organization is generating data at a rapid pace.
Don’t be overwhelmed. You can cut through the noise by defining your business case.
The second step in your data project is to take that business case and make it real in a cloud-native data lake. Yes, a data lake. I know the term has been abused over the years, but a data lake is very simple; it’s a way to centrally store all (all!) of your organization’s data, cheaply, in open source formats to make it easy to access from any direction.
Data lakes used to be expensive, difficult to manage, and bulky. Now, all major cloud providers (AWS, Azure, GCP) have established best practices to keep storage dirt-cheap and data accessible and very flexible to work with. But data lakes are still hard to implement and require specialized, focused knowledge of data architecture.
How Does A Data Lake Solve The Above Problems?
- Data lakes de-silo your data. Since the data stored in your data lake is all in the same spot, in open-source formats like JSON and CSV, there aren’t any technological walls to overcome. You can query everything in your data lake from a single SQL client. If you can’t, then that data is not in your data lake and you should bring it in.
- Data lakes give you visibility into data quality. Modern data lakes and expert consultants build in a variety of checks for data validation, completeness, lineage, and schema drift. These are all important concepts that together tell you if your data is valuable or garbage. These sorts of patterns work together nicely in a modern, cloud-native data lake.
- Data lakes welcome data from anywhere and allow for flexible analysis across your entire data catalog. If you can format your data into CSV, JSON, or XML, then you can put it in your data lake. This solves the problem of “no data.” It is very easy to create the relevant data, either by finding it in your organization or engineering it by analyzing across your data sets. An example would be joining data from Sales (your CRM) and Customer Service (Zendesk) to find out which product category has the best or worst customer satisfaction scores.
If you’re struggling with one of these three core data issues, the solution is to start with a crisp definition of your business need, and then build a data lake to execute on that need. A data lake is just a central repository for flexible and cheap data storage. If you focus on keeping your data lake simple and geared toward the analysis you need for your business, these three core data problems will be a thing of the past.
About The Author
Robert Whelan is a Data Engineering & Analytics Practice Manager at 2nd Watch.