Operationalizing Machine Learning for GDPR

Operationalizing Machine Learning for GDPR

An unstable physical system will tend toward equilibrium. With its General Data Protection Regulation, Europe is subjecting United States companies to a regulation that threatens some of their core business models and data infrastructures. The tension between these two regions will eventually settle, and when that happens, the ways we approach data and think about privacy will be fundamentally altered.

At the heart of GDPR instability there’s a basic disagreement. In the U.S., we generally trust corporations over the government, and we will sacrifice privacy for convenience. Consider the big American idea to productize location data and build Google Maps.

Europeans, however, generally trust the government over corporations. That’s why they’re willing to leave their passports at a hotel overnight, so the hotel can report on guest information to the government. It’s a concept that horrifies many Americans when we cross the Atlantic.

GDPR is like a giant pendulum swinging over the Atlantic between these opposing belief systems. Time will tell where this pendulum will ultimately rest.

The first paragraph of GDPR states, “The protection of natural persons in relation to the processing of personal data is a fundamental human right.” The European regulators asserting this right have broadened its scope to include European citizens wherever their data resides. The result is that Americans are subject to sanctions under the EU regime on an ongoing basis.

The broad scope is also built into the language of the regulation. Every possible data action is governed: “basic principles for processing,” “data subjects rights” in 11 separate articles, from the “right to object” to the “right to be forgotten;” and all “transfers of personal data.” And the fines are as enormous as the scope: whichever is higher of up to 4% of the year’s profits or 20 million Euros.

It will be difficult to overstate the impacts that this regulation will have on American businesses, many of which are based on a premise that privacy can be sacrificed for convenience. Take away the premise, and the business model goes too.

“Senator, we sell ads,” Mark Zuckerberg responded when Senator Hatch asked him to explain Facebook’s business model. If Facebook can’t leverage personal data to target ads, it becomes endangered.

Facebook might be fine, but many smaller ad tech vendors, which rely on targeting-based services, may struggle. As part of its GDPR compliance strategy, Google implemented user ID encryption, which limits an advertiser’s ability to follow and measure user data. This limitation is thought to be the main reason why the ad tech firm AppNexis exited via an acquisition by AT&T.

Other companies in this space, like the ad tech firm Verve, have shuttered its European operations. Julie Bernard, the Chief Marketing Officer at Verve, told CNN that “the regulatory environment is not favorable to our business model.”

As the GDPR pendulum swings between the poles of privacy and convenience, there will be two forces at play. One is regulatory coercion, and the other is a kind of contagion. Sure enough, as the unstable cross-Atlantic system seeks equilibrium, the notion of privacy as a right is spreading. This year, California, square one for data innovation, is following suit with its California Consumer Privacy Act (CCPA) that has been nicknamed “GDPR-lite.”

As organizations contend with life after GDPR, many of them are finding that the only thing they need to change about their data strategy is… everything. GDPR is compelling organizations to adopt two new processes. One is reactive, and the other is proactive.

From reactive to proactive

In the reactive mode, businesses are shoring up investments to take on Data Subject Access Requests (DSARs). Article 63 states that “‘a subject should have the right of access to personal data which are collected concerning him or her, and to exercise that right easily and at reasonable intervals.”

DSARs aren’t entirely new, but the major regulations governing them are, and these new regulations—GDPR and CCPA—come with new demands, like addressing real-time DSARs and new 30-day response time requirements.

The same new program that companies put into place to comply with a DSAR will also need to contend with something else: GDPR data breach response. When a data breach has been discovered, a 72 hour countdown begins. In that time period, an organization must investigate the breach, notify regulators and affected individuals, identify the personal data that has been impacted, and create a containment plan.

The challenges of meeting these reactive modes are enormous, because organizations are bringing old data processes to new data challenges. They are combining manual review teams with decades-old search technology to identify potentially compromised personal data.

This status quo leaves organizations to sift through a large number of false positives as time runs out. Teams of humans will manually identify and record personal data, a process that is error-prone, inefficient, and expensive. Then there are challenges with normalizing data and conforming dimensions. For example, a team will need to reduce varied and inconsistent personal data in disconnected data systems to one individual.

While they’re building these new reactive processes, organizations are also building proactive processes. It is like rewiring the building while the electricity is still on.

Being proactive under GDPR means building a personal data inventory at a massive scale. The inventory is only as valuable as far as it can be understood, including the recording of who is processing what and where, whether that’s in structured data like relationships databases, or unstructured data, like .pdfs in loosely organized systems.

For example, Article 30 requires organizations to maintain a record of processing activities that can be readily available, in case a supervisory authority requests to review those records.

Today, large organizations are managing their data subjects across different environments in different places. Personal data may be stored in both on-premises data stores or in the cloud. It may be stored in divergent warehouses, data lakes, and Internet of Things devices. Understanding this data requires knowing how it is being transferred, consumed, and distributed, but also how the data platforms are being managed and accessed. Then organizations need to procure or develop analytical abilities that can intelligently search across a number of divergent platforms and conform the data down to a standardized – and interpretable – structure.

Underneath the complexity of these systems, there’s an even deeper complexity. The list of what counts as personal data under GDPR is broad and fluid: biographical information like birthdays, appearance information like eye color, workplace data like salary, subjective data like religion, and health data like medical history. Human beings are leaving artifacts of this free-flowing data in their interactions.

The new processes that organizations put into place will also need to allow for a new kind of interface between humans and data systems, so organizations can rapidly identify and act on personal data wherever it is, at speed and at scale. For example, organizations will need to be able to normalize personal data across disparate systems. They’ll also need to be able to redact this data at the data subject’s request. Many organizations will also want some sort of capability for pseudonymization, which attributes a unique but fictional identifier (like an integer) to each redacted individual in a dataset.

Bridging both with AI

As the gap widens between data and understanding, privacy demands are exposing out-of-date strategies and systems. Organizations are shoring up resources to face the changing tides. How can they operationalize a new privacy compliance process to meet the parallel demands of these two modes: reactive and proactive?

Their only hope is to leverage technology that is powered by artificial intelligence. There’s no other analytical capability that can understand context at the scale and speed of today’s risk. AI  has the potential to ingest unstructured data at an enormous scale, and transform it into understanding, so organizations can consolidate their data into a unified and intelligent view.

AI started out as a thing of research. Now it has become a thing of infrastructure, like running water. The utility of today’s AI is thanks to a combination of recent factors: the advent of big data, rapidly increasing compute power, new processing techniques, and better algorithms. These factors are coming together in a perfect storm, and mainstream adoption is the result.

Late last year, McKinsey surveyed 2,000 executives across 10 industries, and found that 47% of companies have embedded AI in their business processes. This represents a rapid increase in adoption: a 2017 study found that just 20% of respondents were using AI in a core part of their business.

In the realm of privacy, a crop of AI-powered companies has appeared, and I serve as an advisor to one of them – Text IQ – which builds AI for sensitive information and has launched a suite of solutions for privacy.

As the privacy-convenience tension stretches toward equilibrium, organizations are sorting through a lot of generalized excitement around AI, and seeking to “bring it in.” With the right advanced capabilities, these companies can choose to view GDPR as an opportunity rather than a threat. The data requirements of this sweeping regulation can provide a useful forcing function to transform a data strategy from the ground up.

By choosing the right artificial intelligence partner, organizations can ensure that this new data strategy is purpose-built to mitigate the mounting demands of Privacy, as we approach the coming innovation inflection points of the ‘20s.

Related articles