Using Data Catalogs to Deliver Actionable Insights for Businesses

essidsolutions

Data catalogs were built for governance. To serve up analytics, they need a few enhancements, including semantic search, a virtualized management layer and pre-analytics visualization. 

Without data catalogs, we’d have a governance nightmare–particularly with the advent of data lakes, which collect vast amounts of highly differentiated data types without requiring a schema. But before we rest on our laurels, we should consider the fundamental reasons we’re collecting this data in the first place. One purpose is, of course, compliance with ever-increasing regulatory measures. But the more fundamental goal is that companies hope to derive intelligence from their data to generate revenue.  

Data catalogs are great at what they do: organizing and identifying data for governance. And the ability to provide overarching structure to architecture that might include data lakes, databases, and data warehouses is particularly beneficial. But they’re not designed to answer questions. 

If you already know the names of the tables you’re looking for, or if a previous user has provided a detailed description of what is in the dataset, you might get lucky and find it quickly. Without that info, it is difficult for an analyst to determine which tables are available to yield answers to the questions they’re tasked with addressing. The process is so labor-intensive that it can take up to 50 percent of an analyst’s time, according to 451 Group’s Voice of the Enterprise reportOpens a new window .

Data cataloging is a means to an end, but it isn’t the end. In this case, the ‘end’ comes in the form of actionable answers to real-world questions like “at what point should we consider a customer in danger of ‘churning’ and instruct our sales team to reach out,” or “why did our Latin American sales slump in Q4?”.   

Data catalogs provide a foundation for a data-driven organization if a few additional capabilities support them. To understand these capabilities, let’s look at the process from the purview of an analyst who regularly faces the following obstacles:

Learn More: Understanding the Role of Edge Computing in Post-COVID Restaurant Recovery

Sifting Through Similar Data Sets

A fundamental objective of data warehousing was to provide a ‘single version of the truth’. Yet, most data architectures still deal with issues of version control. Sometimes data is organized in different schemas in different tables, formatting is inconsistent, or one table will contain a less complete version of the dataset.

Determining which one is the best suited for analysis is typically a manual process that lengthens the time associated with data preparation. For example, an analyst who needs to know how average sales compare across departments in Germany might need to join tables for “sales” and “departments” by their “region” attribute. If multiple tables exist, deciding which to join involves a time-consuming process of looking at which requires the least reformatting, contains fewer NULL values, etc. 

This time-sink could be avoided by supplementing catalogs with the ability to auto-render information about the schemas, completeness of the data, etc., alongside descriptions tagged by previous users of the data, allowing analysts to make quick comparisons before committing to specific tables.

Learn More: AIOps, Not ITOps Will Help Enterprises Manage Modern Environments in 2021

Dealing With Today’s Heavily Distributed and Siloed Data Architectures

Of chief concern to analysts is that tables required to model many business decisions are not collected in the same system. When a table that exists in Oracle needs to be joined with one in a data lake, assembling them is infinitely more complicated than assembling tables from the same database or even from databases that reside in the same systems. Data catalogs tell you where the data is but don’t move it for you–that process can be slow and can create security risks. 

But you don’t need to move data to analyze it. The  ‘Trino’Opens a new window project (until recently known as ‘Presto’) makes it possible to execute SQL queries across distributed environments. With a single query, a dataset can be assembled from tables in totally different systems (Hadoop and MySQL, for example) through a virtual layer that appears to the user as if the data resides together in the same system. Combined with a data catalog, this provides powerful capabilities for locating and retrieving data for analysis. 

Learn More: Data and Analytics Processes Need a Fresh Approach to Help Businesses Thrive

Moving From Keyword to Semantic Search

Analysts frequently encounter the limitations of traditional keyword search, which is the basis of most data retrieval in data catalogs. The functionality is pretty much like what happens when you search your files on your Mac or PC–you get everything that contains those keywords, including a lot of useless things that you didn’t need. 

Natural Language Processing, however, can be used to add ‘intent’ to the search so that the system doesn’t just retrieve data that happens to have the same words, but data that meets your needs based on your question. In this case, it also needs to suggest relationships that might not be apparent through the kind of tagged information currently provided in data catalogs. 

Learn From the Past…Creating a History of Analyst Queries

Often, rather than retracing previously-taken steps to answer a similar question, data analysts are forced to recreate the entire process from scratch. As we discussed, most data catalogs lack intent-based search, so there’s no way to chronicle a history of questions and answers.  

The answer to this is to automatically chronicle questions and their associated SQL queries and map them back to the data sets that were previously deemed useful. But what about questions that are similar to, but not the same as ones that have been asked in the past? Such a system needs to enable analysts to fine-tune queries to their exact specifications, including filtering out unnecessary data (such as nationwide data when the one is looking for patterns in only a specific region, such as the “Southwest”).  

Learn More: 7 Factors That Indicate Your Organization Has a Shadow Code Problem

Is the Data Worth It?

Before an analyst or data scientist does anything with a dataset, they take an initial look to determine its basic shape and characteristics. In some cases, this will help to determine whether the data will be viable with a specific model or whether the data is viable for analysis at all. For instance, does the data conform to a distribution that is close to ‘normal’? Or is it heavily skewed? Is it parametric (conforming to a known geometry) and appropriate for linear modeling, or will it require a non-parametric model such as Random Forests?  

Basic visualizations such as a boxplot, for instance, can help an analyst quickly identify where gaps in data might exist or the existence of a large number of outliers or a dataset that may contain inaccuracies. Currently, the analyst must upload the data to an analysis tool like Looker or Google Colab to obtain such fundamental insights, but if they can do this from a data catalog or other data management platform, they can skip several time-consuming steps. In particular, they can quickly compare similar datasets and make sound judgments about the best ‘version of the truth’.

Do you think it is possible to use data catalogs for analytics? Let us know your thoughts on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We would love to hear from you!