Is some data more personal and private? Are there some personal details that we would prefer to share only selectively? The answer is clearly, yes. With data-centric models becoming central to how enterprises operate and compete, can they also hold themselves accountable for respecting these granular personal choices?
This new accountability for sensitive personal information is the result of both regulatory requirements that cover more types of data in greater detail and extend consumer data rights as well as consumer awareness of data privacy concerns.
Although there has been a proliferation of privacy management vendors, most of these tools focus on how to facilitate compliance reporting through workflows and management dashboards. Most stop short of providing the context needed to understand privacy risk based on data sensitivity, even if they perform some degree of data discovery.
Achieving consistent accuracy, contextual understanding, and ongoing insights in order to make the distinction between highly personal, sensitive data and PII has emerged as a key challenge to balancing data-driven enterprise strategies and emerging data privacy considerations.This is the challenge that Text IQ is focused on tackling: using composite AI for accuracy in the identification of PI at scale in unstructured data, natural language processing for contextual understanding, and continuous learning through human feedback to enhance precision for data sensitivity.
What has changed? The second wave of privacy regulations
Ushered in by the introduction of the California Consumer Privacy Act in 2019 as the US’s first state-level data privacy regulation, 16 states have followed suit with privacy laws on the docket.
But it’s more than a numbers game: the laws, in general, expand the scope of what data is covered and introduce a new set of reporting, accountability, and protection of privacy obligations for sensitive personal information, like financial account information, login credentials, health data or precise geolocation, that can be tied to a person.
It’s no coincidence that California's Consumer Privacy Rights Act (CPRA), which will come into effect in early 2023, is often referred to as CCPA 2.0.
CCPA introduced the definition to the US of personal information as data or information that could be linked, associated, or related to an individual (along the lines of the EU GDPR) - in contrast to a unique PII identifier like a Social Security number.
Now, CPRA will include a broader definition of sensitive data than even the EU GDPR to include government-issued identifiers, account log-in credentials, financial account information, precise geolocation, contents of certain types of messages, genetic data, racial or ethnic origin, religious beliefs, biometrics, health data, and data concerning sex life or sexual orientation.
CPRA will mandate that consumers are able to limit the use and disclosure of their sensitive personal information that an enterprise collects and shares (often referred to as an ‘opt out right’) - a significantly more complex undertaking than reporting on data categories for data subject access rights (DSAR) reports. And, if an enterprise wants to use it for purposes other than for which it was first collected, it will need the consumer’s authorization.
What else has changed? More emphasis on data as a business asset, and more of it
Even as the privacy component of data privacy is subject to more regulatory complexity, the data component has become increasingly central to enterprise strategies - as the meteoric rise of data platforms like Snowflake and DataBricks illustrate.
Do data privacy concerns run headlong into the investments that companies have made into being data-driven?
Yes - in two senses: enterprises should be able to enforce data privacy policies that govern who, how and what data their internal teams, partners, and providers can access. And, the new wave of regulations could potentially prove a barrier to innovation and insights if enterprises cannot effectively determine what could be used, even with the right set of protection tools and processes in place.
Equally, with the ability to assess data privacy sensitivity at scale, enterprises can in turn effectively implement protection tools like privacy-preserving analytics or de-identification. By contrast, manual data governance processes that rely on techniques like manual labeling will never be able to scale and provide the accuracy needed to facilitate data-driven strategies.
What has not changed? Current tech to determine how sensitive and personal data is
When legacy vendors or platform providers talk about finding PII, they are talking about technology used to identify four or five common identifiers, like Social Security numbers, date of birth, account numbers, drivers’ license numbers, and potentially a health insurance number. Even for these commonly used identifiers, however, enterprises struggle with false positives - instances where the technology incorrectly identified a string of numbers or text as PI.
If looking for a string of characters is inefficient and inaccurate for a small number of data elements that follow a consistent pattern, then this traditional approach will prove wholly ineffective with sensitive data that can take multiple forms.
And, while sensitive data may be stored in a labeled column or field in a structured database, those labels may be incorrect, and only useful for the shrinking proportion of enterprise data stored in these data sources.
How can it change? AI for PI
Given the complexity of the problem, a more sophisticated and layered approach is needed. In the same way that AI can be used to glean insights from business data, AI techniques can be used to better understand context through content analysis as well as infer data relationships for unstructured data.
By taking a composite approach that starts with more accurate identification of common, vertical, or business-specific identifiers and then linking the identifiers to individuals through a socio-linguistic hypergraph, enterprises have a more solid foundation in place for baseline data privacy compliance.
To take the next step to both better assess data sensitivity and deliver automated identification in unstructured data, Text IQ employs a composite AI approach combining multiple techniques like natural language processing, entity normalization, and unsupervised machine learning.
Add custom detectors with ease.
The benefits of the approach are:
- Enhance accuracy for PII, PI identification as well as custom classifiers through AI detectors for unstructured data
- Generate and maintain a human-centric index based on entity profiles and data relationships
- Drive continuous learning that integrates human feedback on the accuracy of identifiers as well as input on entity-data linkages
Screenshot of the Text IQ PI detectors at work.
The outcome is that enterprises no longer need to compromise with the limitations of binary identification processes that can’t understand the nuances of sensitive data or models that are mostly static and can only be trained on a limited set of inputs.
AI that is designed to understand and learn what is personal and sensitive data, can help enterprises bridge the divide.
To learn more about Text IQ’s AI for privacy risk visit here.