Democratizing AI for Custom Document Classification

Democratizing AI for Custom Document Classification

Accurately identifying personal information (PI) and classifying documents in unstructured data at scale is hard. It’s been a challenge for most enterprises that use existing approaches to get it consistently right, even for common, general identifiers, such as Social Security numbers (SSNs). The challenge of automating identification is even more acute for types of PI that are less general, whether specific to an industry, or even a single company. 

Identifying PI is even more critical now to ensure compliance with data privacy regulations, but has long been important in the context of litigation. PI in the context of data privacy encompasses both direct identifiers, like SSNs, as well as related information about an individual, ranging from patient identifiers and account numbers to unstructured healthcare data (or even participation in a clinical trial). 

In addition to PI identification, document classification for both PI and sensitive corporate information is central to emerging data minimization requirements, whether for information governance or regulations like the forthcoming California Privacy Rights Act (CPRA) that mandate PI is only kept as long as it is needed for a specific business purpose. 

In the past, enterprises and their counsel have contended with the challenge in two ways: either throwing bodies at the problem to work through results manually after casting a wide net with search terms; or, they painstakingly train a machine learning tool to label and code documents - often requiring seed sets of hundreds of thousands of pre-coded documents. Companies and their attorneys have attempted to implement supervised machine learning approaches that require expert upfront training by experts with both domain and technical expertise. Getting this supervised machine learning model to start producing actionable and consistent results can be a protracted process. 

Text IQ’s new Custom Document Classification capability is aimed at tackling these two key operational hurdles: the technology enables non-technical users to build their own detectors for specific types of PI using their domain knowledge and refine pre-trained classifiers; and, allows them to easily adapt the AI to less common, or industry-specific, sensitive information types without enormous upfront investment in training the models. 

document classification gif

Text IQ, in contrast to earlier attempts to apply AI to document classification, has built an approach based on unsupervised machine learning algorithms. The Text IQ classifier algorithm can start to infer structure and relationships between data points based on a significantly smaller dataset.

Text IQ custom detectors incorporate the ability to define which type of documents to disregard. Having a good idea of what not to look for, certainly helps make the AI better at being more certain about what it does need to look for. 

document classification

Less need for experts, more benefit from machines

Our Custom Document Classification is an innovative new way to adapt Text IQ’s artificial intelligence platform to different domains and industries and train a Text IQ brain for individual clients or matters. Text IQ’s approach incorporates context, unsupervised machine learning, and iterative, continuous feedback to effectively and consistently identify and surface PI. 

Build Your Own Classifier

  • Non-technical users can build a custom classifier that can be reused across future projects by providing the machine with Keywords and Regular Expressions or a small seed set of documents via Text IQ’s intuitive user interface.
  • This information is used to make predictions within the desired data set.
  • Once enough positive and negative examples are collected, a machine learning model takes over

Code and Train - Customizable Coding Panel and Intuitive Training Platform

  • Teams can define their own coding panel, including document classifier categories
  • Document categories enable reviewers to evaluate the machine’s output and provide feedback that can be incorporated.
  • Document categories learn from the content and metadata of any unstructured file - including any tagged personal information.

Why Negative Examples are Good for Positive Reinforcement 

The shortcomings of these earlier attempts at using AI for document classification point to a flaw in their design - even before the question of the relative time and effort spent training PI detectors. When building a general AI model for identifying PI, there are often documents that confuse the algorithms. For example, scientific journals are filled with medical information but rarely contain anything that would be considered to be personal information. 

Identifying common types of documents that generate false positive hits as part of the training process improves precision and accuracy. This approach simplifies the process of training the AI, and removes identifying where the machine has made an erroneous inference or assumption that would in the past require specialized expertise or in-depth product knowledge to rectify. 

No Training Wheels Required

document coding

Custom Document Classification opens up the Text IQ platform so clients can define and train a collection of document classifiers to identify documents that contain personal information. We have also built a set of pre-built models for industry verticals as well as common document types typically containing PI to help expedite the process of identification. 

Pre-built models:

  • Curriculum Vitae
  • Adverse Event Reports
  • Case Reports (Pharmaceuticals)
  • Scientific Journals and Articles
  • Individual Tax Forms (W-9s)
  • Automated Mass Emails

Build Your Own Detectors 

The initial impetus for Custom Document Classification was the realization that general approaches to PI identification will inevitably fall short in dealing with more specific types of PI. But applying unsupervised machine learning more effectively is only part of the story for Text IQ. 

As we saw in the first wave of attempts to apply PI to document coding, it was not just design limitations that limited their value. Requiring experts with technology and domain expertise created a bottleneck in the process. That bottleneck, in turn, restricted the scope of training that could be implemented for document classification. 

With an unsupervised machine learning model that now includes the ability to easily define what documents not to include, customers don’t have to contend with that bottleneck. By democratizing AI, the Custom Classifiers put the power of machines in the hands of non-technical domain experts. 

To learn more about how Text IQ can address your document classification challenges and to schedule a demo, contact us info@textiq.com