Text IQ Series | Revisiting Automatic Redaction: An Introduction to Automatic Redaction with AI

Text IQ Series | Revisiting Automatic Redaction: An Introduction to Automatic Redaction with AI

Generally regarded as a necessary evil, redactions plague industries such as legal, life sciences, and financial services with countless hours and costs associated with the tedious, mundane work of omitting privileged, protected, and personal information from documents. In this installment of Revisiting Automatic Redaction, we take a closer look at the necessity of it and how AI can help automate this arduous process.  

How many labor hours have you devoted to redaction in the last year? 

Whether you’re personally going through the mountain of digital documentation or paying someone to do it for you, then you already know two key facts. First, this tedious process can take a long time. Second, there must be a better way. We’re happy to share the good news: we’ve built one.

Before we get ahead of ourselves, let’s talk about redaction itself and the challenges arising from the unprecedented volumes of data generated in the information age. Redaction is the process of obscuring or removing information, such as text or pictures, from a document or database prior to publication or release. Organizations redact information that’s classified, sensitive, or that contains potentially harmful (if leaked) information such as trade secrets, financials, or personally identifiable information of customers or employees.

Redaction protects an organization and the people it serves. However, as we collect and store more data, redaction has become increasingly more difficult. The process takes longer, and it’s challenging to maintain high accuracy levels and consistency. 

We need a modern solution for a modern problem.

Sensitive Digital Data

Data is exploding, and significant portions of that volume is sensitive data that we must protect. We can break this down into three main categories.

  1. Organizational information - Classified governmental data is a great example, but everything from trade secrets to past judiciary records to other legal information falls into this category.
  2. Personally identifiable information (PII) - This is protected by regulations like the US GSA Privacy Act and the EU GDPR. The GSA defines it as “information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other personal or identifying information that is linked or linkable to a specific individual.” While broad in scope, this category contains info such as names, addresses, social security numbers, birthdays, etc.
  3. Personal Health Information (PHI) - Which includes any personal information contained in a patient’s medical record and obtained during the course of medical treatment that, if disclosed, could reveal the identity of the patient. PHI is protected by federal regulations like HIPAA in addition to state level protections such as those in California. PHI includes items like medical record number, fingerprints, treatment histories, and more.

The takeaway is that we have massive sensitive datasets that, if released for whatever reason, can lead to serious consequences like huge fines and reputational damage.

Legacy Redaction Methods

Before the digital age, redaction meant inking over or cutting out information from a document, and—believe it or not—there are still some people who print out their documents, redact by hand, and then scan them back into their computer systems. It goes without saying that this is wildly inefficient.

More recent methods involve redaction software that provide a user interface for searching through documents and making redactions. For instance, Adobe Acrobat Pro lets us search through PDFs to redact sensitive information. While we can look for keywords or even black out every instance of specific words or phrases, we still face serious problems.

Not only does it take a lot of time to redact private info from tens of thousands of emails or from databases in the terabyte range, but it’s also easy to miss something. We might search for keyword “address” and miss the typos “adress” or “addres” made during data entry, or we may not find the phone number with a 123-456-7890 format among all the numbers with a (123) 456-7890 formatting. And that doesn’t even begin to address codewords, exceptions, or other irregularities.

Even if we do program an application for auto-redaction, we may solve the time issue, but we don’t come close to addressing the accuracy problem. The simple fact is that our data is way too messy for rule-based programs to handle. This leaves us in a bind: we can’t rely on dumb software to get the job done, but we also can’t afford for people to go through it manually.

That’s why we need a smarter computer program.

AI for Automatic Redaction

In May of 2020, Text IQ launched AI auto-redaction, a product that “uses machine learning to go beyond regular expressions and to understand context.” This has already been a huge game changer for our partners, and so we want to share a bit about what it is and why it works.

It all starts with unstructured data. Human beings are great at understanding emails, text documents, instant messages, photos, audio files, and other types of unlabeled data, but rule-based applications struggle to make sense of it. That’s why we use a type of AI called self-supervised learning to parse it. These algorithms begin training on clean data, like the type that a search function could easily find, and then use that learning to expand out into the chaotic noise that makes up most of our datasets. Instead of relying on us to tell it the rules, self-supervised learning figures them out on its own.

Ultimately, we’re left with a layer of intelligence that’s both fast and accurate. Once we’ve set the parameters, Auto-Redact independently goes through the data, finds sensitive information, and redacts it. It’s quicker, easier, and more reliable than any other method.

When we spoke about HIPAA redactions with Leeanne Mancari, Litigator and Co-Chair of eDiscovery and Information Management Platform at DLA Piper, she told us that “Anywhere we can reduce that burdensome process would be extremely helpful.” She concluded, “Hearing things like auto-redactions gets me very excited.”

The Takeaway

Computers excel at automating routine, tedious tasks, and this frees people up to do more interesting work that adds higher value. Before now, redaction was too complicated for computers to handle. Between their inability to understand context, semantics, and other intricacies and the high risk of failed regulatory compliance, we couldn’t trust them with independent redaction.

Those days are over. Our AI is smart enough to tackle these complex tasks. That’s how Text IQ’s Auto-Redaction delivers fast results that you can count on. Contact us to learn more.