What if, instead of opening files to figure out if they contain privileged or sensitive information, you could simply look at your filesystem and tell at a glance what files include what kinds of information? Or, better yet, what if we could automate this system so that we can create a comprehensive information governance strategy that marks and categorizes data in real time as it’s created?
An effective tool to support information governance strategies at scale and with a high degree of automation is unsupervised machine learning (UML) technology that adds valuable context to inform how data is used and safeguarded across the information lifecycle. For instance, metadata enrichment can assign relevant tags and eliminate the guesswork to tell us whether a document is responsive in the context of a legal matter, add business context such as which department it deals with (accounting, HR, etc.), as well as determine whether specific regulations such as CCPA, NY SHIELD apply since the data contains personally identifiable information (PII) or protected health information (PHI), and much more.
As more enterprises realize the extent to which both consumer expectations and proliferating data privacy regulations will impact how they govern their data, deriving insight from metadata can become key to data-driven business strategies. Ideally, these insights derived from metadata are in a format that can be easily consumed by existing enterprise data management tools, data catalogs and enforcement tools like data masking to reinforce their value (and the investments already made in the tools).
One of the areas where metadata enrichment can play a particularly pivotal role is in providing a layer of structure to unstructured data. Compared to structured data, where existing tools can perform passive metadata inventory on predefined structures, such as columns, tables, and fields, unstructured data - as the name implies - lacks any predefined structure.
However, enabling enterprises to scale and automate governance for unstructured data is especially pressing given the unprecedented data volumes that we’re now generating. It’s true that there's no data like more data, but at the same time, without a comprehensive plan for realizing value from that raw data, these digital treasure troves remain untapped - or a significant risk if not appropriately governed.
In this article, we’re going to present an innovative technology that we’ve developed to help organizations ensure that their data is an asset, and not liability. We’ll begin with a short introduction into what metadata is, and then we’ll give a brief overview of how we enrich it. We’ll conclude by demonstrating how enterprises can use metadata enrichment for use cases such as compliance, litigation, and analytics.
What is Metadata?
Looking just at the word’s etymology, it’s clear that metadata relates to data in the same way that metaphysics relates to physics or metacognition to cognition. Simply put, metadata is data about data.
According to the Research Data Management program at UC Berkeley, “Metadata helps you and others understand data so it can be accessed, found, understood, and preserved over time. For many kinds of data, standard schemas exist that facilitate data description and sharing.”
Common metadata fields include what the data’s type (video, text, audio, etc.), when and by whom the data was created, what software created it, how big it is, and how it’s encoded. Moreover, we generally break metadata down into three categories: technical metadata related to its structured, location, and format; operational metadata, which includes details like when the data was last updated or modified; and business metadata, such as the data owner’s identity.
Check it out for yourself. If you’re on a Windows PC, open the File Explorer, right click on any file, and click “Properties” in the drop-down menu to see some of the file’s metadata. On Mac or Linux, it’s as simple as opening the terminal and running the command “ls -l”.
While metadata is already used for system administration tasks, such as deleting expired data to comply with regulations, these basic types of metadata still provide enough context about what the data is actually about or whether it fits into some of the higher-level categories that we discussed above.
Enriching the Metadata: The Basics
The process of enriching metadata isn’t so different from enriching flour. Just as we add nutrients such as thiamine, folic acid, and iron to make our bread more nutritious without altering the bread itself, metadata enrichment likewise makes our data more useful without changing anything about it. The process of deriving insights from metadata - in contrast to extracting and inventorying metadata for reference as has been done until now - results in what is known as ‘active’ metadata.
Here’s a basic rundown of how it works for unstructured data. We can start by defining labels, such as confidential, public, or internal, and then run our machine learning algorithms to group large volumes of our data into these classes. In this way, we can sort our data into practically any number of categories with the intent of creating more organization to focus governance and risk mitigation efforts. At a point when many enterprises have simply thrown their hands up in response to their growth of unstructured data, this seemingly straightforward step can quickly positively impact governance.
The other way to go about it is to run the algorithm without first defining the labels. The AI will use clustering to automatically and independently detect patterns within the data. It then sorts the data into groups, which a human can then look at and define. In this way, we’re able to detect potential categories that we never would have discovered on our own - or would require so much manual input that it would literally take years.
A key takeaway of this process is that people play a crucial role in leveraging metadata enrichment for governance outcomes. While the algorithm can automate the rote work of going through each and every file, the real value comes from the operator’s critical thinking and ability to assign correct and meaningful labels to the different categories. The machine learning process can perform the task of identifying content and context at a fraction of the time that humans could, and then automatically propagate the label with a business, policy or regulatory meaning to thousands or even millions or files. The outcome is visibility into large volumes of unstructured data for ongoing governance.
It’s this symbiotic relationship between the user’s crucial insight and the machine’s unparalleled ability to churn through huge datasets that’s at the crux of metadata enrichment.
Metadata Enrichment for Proactive Information Governance
Metadata enrichment has many use cases, but the one where we’re seeing our enterprise clients gain the most value is with proactive information governance. Instead of constantly reacting to situations as they arise—or to fires as they need to be put out—companies can use metadata enrichment as a tool to prevent those fires in the first place.
Ultimately, this technology forms a layer of intelligence that sits on top of unstructured data. We can then tap into it by running data analytics, conducting searches, or writing software to interact with it. In this way, metadata enrichment sets the stage for predictive insights based on active metadata and preventative data maintenance because it gives us a simple way to immediately answer many of the most pertinent questions that we could ask of our data.
That’s how it speeds up the process of preparing for litigation, reduces the risk of failed compliance, and enables big data analytics for informing better decision-making.
Text IQ views metadata enrichment as just one part of a holistic enterprise AI strategy. Not only do we enrich the metadata, but our AI brain also learns the unique attributes of your organization’s data, making our technology faster, smarter, and more accurate over time.
Ready to see it in action? Request a demo today.