A data catalog is defined as the inventory of all data assets in an organization. It helps data professionals find the most relevant data for any analytical or business purpose. A data catalog uses metadata to create an informative and searchable inventory of all data assets in an organization. This article discusses the definition of a data catalog, its building process, and the top 10 best practices for a data catalog in 2021.
Table of contents
A data catalog is the inventory of all data assets in an organization that helps data professionals find the most relevant data for any analytical or business purpose. It serves as an inventory of data and provides the necessary information to evaluate the fitness of data for intended uses. It also helps analysts and other data users find the target data they need for specific purposes.
Let’s consider the analogy of a library.
When you want to find out whether a particular book is available in a library, you generally use the library catalog. Along with its availability, the catalog also tells you about the book’s edition and location. In short, the catalog gives you various details of the book to decide whether you want it. And in case you do, it tells you how to find it. This is a basic offering of many object stores, databases, and data warehouses today.
Let us now expand the power of that library catalog to cover every library within the country. Imagine for a moment that you have just one user interface (UI), and you can find every single library in the country that has the copy of the book you’re seeking on it. You can also find all the details you would ever want on each of those books on that single UI.
This is exactly what a data catalog does for all of your organizational data. It gives you a single and comprehensive view with visibility into all your data, rather than just a single data store at a time.
Recent research conducted by Aberdeen Strategy & ResearchÂ demonstrates that data cataloging empowers users with analytical ability, which, in turn, drives business performance. Users with a data catalog not only report an increase in the total customer base but also an improvement in satisfaction among existing customers.
Data catalog metadata subjects
Data catalog metadata subjects
In today’s age of big data and self-service analytics, data catalogs have become pivotal for metadata management. The metadata of the modern age is much more expansive than metadata of the business intelligence (BI) era.
Data Catalog Users Drive Enhanced Business Execution
Source: Aberdeen Strategy & ResearchOpens a new window
As per Aberdeen’s research,Â today’s companies deal with data environments that are growing in excess of 30% year over year, some much higher than that.Â Data catalog tools enable data teams to locate, understand, and utilize data more efficiently by organizing data from multiple sources on a centralized platform.
A data catalog primarily focuses on datasets (i.e., the inventory of available data) and then connects those datasets with rich information to keep the concerned people informed to manage data. A data catalog has the following metadata subjects at its core:
Let’s look at each metadata subject in detail:
Datasets are the files and tables that are accessed by organization personnel. These may reside in a data lake, warehouse, master data repository, or any other shared data resource.
2. People metadata
This describes the people who work with data, including consumers, curators, stewards, subject matter experts, etc.
3. Search metadata
This metadata supports tagging and keywords to help people find data.
4. Processing metadata
This category elaborates the various transformations and derivations that are applied as data is managed throughout its lifecycle.
5. Supplier metadata
The supplier metadata includes the data acquired from external sources as it informs about sources and subscription or licensing constraints associated with the data.
There are five steps to building an effective data catalog. Let’s look at each step in detail:
1. Capture data
Building a data catalog calls for capturing all your data. To ensure the collection of the right data, two questions need to be answered: which metadata to capture and how to capture it?
Let’s address each one at a time.
Which metadata to capture?
Populating the data catalog with the shape, structure, and semantics of your data is the first step in building a data catalog. Most data users such as data scientists, data engineers, business analysts, and others refer to data in terms of the schema or table where data resides. Consider the following questions and answers as examples:
- Where can I find customers who have purchased at least one item?
Check the â€œcust_purchasesâ€ tables
- How are invoices generated?
An invoice has one or more orders in it. Check the â€œinvoicesâ€ and â€œordersâ€ tables for data. In case an invoice has been paid, you can find the payment in the â€œpaymentsâ€ table.
Today, streaming data and non-tabular data (e.g., JSON, Parquet structs) are seen everywhere, and their volume is visibly growing at an increasing rate. Even if you do not use these technologies today, look for a data catalog that supports nested data structures and allows you to integrate streaming technologies in the future.
Finally, an effective data catalog must be able to capture data lineage. Data lineage enables users to see where the data came from and the trajectory of the data. This is critical to providing context that users often need when using data.,
How to capture the metadata?
Once the data catalog is built, you will want a tool that can easily populate the catalog on your behalf. This saves considerable time as it avoids manual updating of every database, table, and field in your data ecosystem. All major databases and data stores (e.g., AWS S3) have APIs available that allow you to extract the metadata that represents the shape and semantics of your data. Hence, you should consider the ability to automatically populate your metadata when building your data catalog.
There are scenarios where you may not be able to connect directly to your database. Consider, for example, that you do not want to expose sensitive data or you are using a managed database that is not publicly available. In such scenarios, you should be able to use sample files and extracts from your data store as an alternative to a direct connection to your database.
In worst-case scenarios, when everything fails, you should be able to quickly capture data on your own without automation. Keeping in mind the frequency of change for all of the client libraries of the disparate databases, one cannot guarantee a perfect process or a tool. Therefore, having an option of mending problems all by yourself is critical for building a robust data catalog.
Also Read: Top 10 Data Governance Tools for 2021
2. Assign points of contact
After building a data catalog, it is important to identify who the important people are for each data asset. Hence, assigning data users such as owners to your data assets is important. This allows users with additional questions or queries to reach out to the right individual.
The questions of various data users can be categorized into two categories:
- The business context for this data asset
- What does Null mean for this field?
- The technical attributes for the data asset
- Who can add this new field to the schema?
- The business context for this data asset
A data catalog may have many types of owners (e.g., data steward, technical owner, business owner, executive owner, etc.). However, the data steward and the technical owner play an important role. The data steward enables your users to know who to go to for all business-related information. Meanwhile, the technical owner has answers to tech-oriented questions that data users may have.
As you create a data catalog, you may assign tasks to your owners. These tasks are intended to ensure that your data catalog is well documented and useful to other teammates.
3. Document every interaction
As you begin documenting your data in a data catalog, the quantum of information you wish to capture may seem overwhelming at first. Suppose you have two databases, and each database has a few dozen tables. Each table further has a handful of fields. At this moment, it appears that you are already looking at a few thousand data assets.
Hence, you can start by picking a single methodology and slowly adding documentation over time. This will ensure that you achieve a certain coverage percentage, perhaps 90% or less, within a few months. Some common methodologies include:
- Whenever you learn about it, document it
Everyone should take responsibility for updating the data catalog when they learn something new that has not been documented yet.
- As and when code change occurs, change the documentationÂ
As teams release new features, concerned team members should update the data documentation.
- Set aside time for team membersÂ
Ask each of your team members to spend one hour a week, or perhaps 15 minutes each morning on the data catalog. This will allow them to add new documentation for the data assets they know well or research the ones they do not know.
All data assets should have rich-text documentation within the data catalog to give users the ability to highlight key points. Data catalogs should also provide users the ability to group assets in common sets. This can happen via tagging the data. For example, if you want to be able to see a report on all of your personally identifiable information (PII), you could tag all of your tables and fields that contain such data with â€œPIIâ€.
Besides, when your data catalog allows your users to have conversations with your data, you unlock the power of documentation. When a user has a question about data and that data is eventually answered â€“ then the question, the answer, and the conversation that led to the answer should be documented within the catalog.Â
This permits the next data user with a similar question to be able to view the previous conversation and understand the context around the answer. This saves time as countless conversations that repeat the same set of questions and answers would be documented. For example:
- Person A: How do I connect to the database from my PC?
- Person B: You just need to log in to the VPN, and you can point directly to the database host. (documented)
In this example, Person A can refer to Person B’s documented answer for the required solution.
4. Ensure that the data catalog is up-to-date
One of the major challenges faced by organizations is to keep the data catalog fresh. Developers generally change the structure of databases once in a while and often create new pipelines. Data scientists and business analysts generally create data cubes or move data between analytical environments to create new dashboards just as frequently. Citing these patterns, your data catalog should automatically identify these changes where possible and update itself accordingly.
To ensure that the data catalog is fresh, some user interaction to double-check the quality and staleness of the information is important. Your data catalog can use governance actions to push your users to take action when they think that the underlying documentation may be old or obsolete.
5. Optimize according to the need
Every company uses a data catalog according to their requirements and needs. So, you need to set standards and norms for the way you want your organization to utilize the data catalog. It is important to note here that the way your team plans to use the data catalog will highly influence how you capture documentation. Therefore, if you do not know how your team will use the data catalog, it is highly likely that the time you spend documenting your data will lead to inadequate results.
Some common practices that your team can do to optimize your interactions with a data catalog:
- Set standardized documentation formats and use across databases, schemas, fields, and data lineage.
- Determine key learning modules and tag the assets included in each learning module with a common theme.
- Emphasize team norms on the usage of the data catalog. This will deeply embed the data culture amongst the team members.
Data catalogs can be powerful platforms for data management. However, without a proper data cataloging methodology, the power and features of data catalogs can go in vain. With that in mind, here are the top 10 best practices for data cataloging in 2021.
Best Practices for Data Cataloging in 2021
1. Add everything to your inventory
Data is everywhere â€” text files, spreadsheets, and many more. Although the data may be scattered, yet you can’t even begin to address the data issue until you’ve inventoried everything. Everyone in the team should be trained to think about all the places where their data may be nestled. Then ensure that every piece of that discrete data is cataloged.
2. Manage data flowsÂ
Data lineage and provenance tools are good, but most of them map out the data flow within a known domain or set of domains. A good data catalog, one that’s backed by data flow discovery, will often identify flows between disparate datasets. Such arrangement helps you discover data movement within your organization that may not be well-known. These flows can then be checked for validity. Hence, managing data flow is a good practice for building an effective data catalog.
3. Prioritize sensitive data
One of the main purposes of an effective data catalog is to help identify the location of sensitive data. In scenarios where the same sensitive data is found in multiple places, it can help identify redundant data. Thus, managing sensitive and redundant data allows you to minimize the surface area for breaches and establish robust data protection against any external attack.
4. Consider unstructured data
Unstructured data (documents, web pages, email, social media content, mobile data, images, audio, and video) is the data that does not conform to a data model and has no easily identifiable structure. It is not a good fit for a mainstream relational database. That being said, your data catalog can help make implicit data structures explicit. This can be achieved by re-designing the overall data structure based on the team or organizational requirement. Hence, considering â€˜unstructured’ data can be vital for any data catalog.
5. Assign discoverable names and descriptions
A good name and a verbose description will make your data more discoverable by concerned team members. A description can indicate alternate names for the same object and help build out a comprehensive data ontology.
6. Treat data lake tables differently
In relational databases, data may be spread across multiple tables. However, data lakes tend to crowd lots of data into individual files. In the business intelligence area, a single dataset may store measures and dimensions together rather than separately. This is true even for systems that represent data as tables in a database. This can make the data less discoverable, but data catalogs address this problem head-on.
7. Provide transparent ratings
Crowd-sourced ratings, endorsements, and negative ratings in your data catalog can help users get relevant and reliable information in a faster way. But this calls for stringent standards. Data shouldn’t get a five-star rating unless it meets a very high-standard benchmark. Likewise, good data shouldn’t be rated poorly. Users need confidence in the ratings, or they won’t trust them. Hence, an organization should ensure that the standards are uniform and precise.
8. Make it a lake, not a swamp
Cataloging everything in your data lake allows you to organize it and make it usable. Once your lake is cataloged, you can establish zones within it and make it a go-to place for business users to get data, not just a place for them to dump it.
9. Employ rules for data validation
English descriptions in a data catalog are important as they help record and circulate so-called obsolete knowledge to various business users. This requires the involvement of technologists, as strict data validation rules can help verify whether data matches catalog definitions. Such a process assures data quality and acts as a check against more qualitative star ratings. Therefore, employing streamlined validation rules in the data catalog instills trust among the data users.
10. Leverage machine learning techniques
Manual cataloging is an impossible task today owing to the increased data volumes. Cataloging will simply never finish or even keep pace as new data arrives. However, machine learning (ML) is a promising tool to assert control over the volume problem.Â
ML models can identify data types and relationships. This helps build out your catalog across more datasets. It also propagates data tags across more objects more quickly than a manual catalog. Hence, if your data catalog doesn’t leverage ML in the actual data, you may face enormous headwinds in your data-driven journey.
In summary, a data catalog is a guidebook to your data that is organized in a manner that makes sense to you, your team, and your business. With a streamlined approach, you’ll be in a position to manage, govern, and utilize your data to its fullest potential. The above top practices should give you a good head start on the data catalog path.
Data catalogs play a critical role in an organization’s journey to achieving data intelligence. It is an important factor in driving revenue, optimizing operational efficiency, and promoting innovation and growth. Now that you’re aware of the significance of a data catalog, we hope you deploy a data catalog that best suits your business needs.Dat