Advancing Document Review: From Regular Expressions To Artificial Intelligence

Advancing Document Review: From Regular Expressions To Artificial Intelligence

Short for Regular Expression, Regex is a sequence of text programming that is used to generate and recognize patterns to identify, define, and manage text. 

Regex has been traditionally used in the document review process to search for specific patterns, as opposed to certain terms and phrases, and as an analytics tool to help filter extraneous text. This was an early option for redaction and document review for many years as it was an improvement on the manual e-discovery process. Unfortunately, Regex can be complicated and require a significant amount of technical training, especially for less tech-savvy attorneys. Additionally, over the years the technology used for regex has been found to be less consistent than auto-redaction options that use artificial intelligence (AI), and may even miss some important document review patterns.

The Downsides of Regex

Search Limitations - Though regex may be helpful in identifying information containing numbers (such as credit card or social security numbers), it can fall short when it comes to alphabetical text detection. It can be difficult to program the regular expression code to recognize the vast array of letter and word combinations needed to decipher information like addresses or health history.

Complicated Work Product Review – Regex condenses a lot of very complex information into just a few characters. While this may be good for information storage, it can make review and cross checking of the redacted information as well as generation of privilege logs more challenging.

Technical Difficulties – Since the use of regex requires a trained eye, the complex nature of the program makes fixing design flaws in the code or “bugs” in the system hard to do at the attorney level. You may need to seek regular IT involvement. 

Requires Specified Formatting - As touched on above, regex is most useful for information that is provided in a limited number of known formats, like numbers. Unfortunately, many potential uses for regex aren’t in ‘regular language’ format.  

Today, due to advancement in technology, there is a better, more reliable option to streamline your e-discovery and redaction process – Artificial Intelligence (also known as AI). As you will see below, AI performs and analyzes more like the human brain, and as such, it has the incomparable ability to review complicated or unstructured text with greater accuracy than regular expression.

How AI Transforms the Document Review Process

Given the influx of electronic data sources, like email, pdf, and messaging services, to name a few, the volume of data that is subject to document review has grown exponentially. This means confidential information, personal identifiable information (PII), and protected health information (PHI) is harder than ever to identify and redact.

By not requiring extensive training and providing a more reliable review, here are a few ways AI can save firms a tremendous amount of time and money when compared to manual review and even regex.

Context Recognition

The value of AI solutions using Natural Language Processing (NLP) and Machine Learning (ML) lies in their ability to assign meaning to the words or phrases it detects based on the context in which they are set. Regex cannot accomplish this. Regex can simply extract or detect this information based on the pattern recognition and matching. Unfortunately, in instances where you may be searching for a word that has multiple meanings, a regex redaction will undoubtedly miss some information. For example, since AI, coupled with NLP and ML, is capable of recognizing and understanding context, a word like orange that can be used to describe a city (which may constitute PHI), a fruit, or a color will not be misconstrued and thus, mislabeled or under or over redacted during the review process.

This is a crucial feature AI has that regex lacks. Accuracy in document reviews is tantamount for e-discovery. Oversharing or withholding relevant information can cause timing and responsiveness issues in court and with the other party.

Evaluation of Information in Multiple Formats

Another advantage AI has over regex is the ability to conduct an intelligent analysis of unstructured data. Regex relies heavily on information being presented in a certain format in order to most accurately detect it. AI, using NLP, understands language and can, in essence, “read” a document just like a human, understanding the context beyond just the four corners of the page. Functions like analysis of semantics, word clustering, keyword recognition, and relationship building, enable information embedded in a various unstructured and structured formats to be detected easier. Plus, AI completes the review significantly faster and with greater accuracy.

For example, in order for regex to review for and detect relevant information such as an address, the user would need to compile a database of every potential address format contained in their documents and use it to set search parameters. With AI, however, you can teach your system address recognition plot points and as it searches through documents it will begin to pick up on context and eventually be able to identify addresses without manually inputting their formats.

Broader Data Set Recognition

Regex focuses on specific data sets and parameters as set by the programmer. This makes it difficult to conduct a search without extensive and tedious keyword entries, and limits your search to only what has been specifically programmed. Under the regex methods, you are more likely to miss information that was not considered beforehand and programmed into your search. This can often result in multiple rounds of review.

With AI, since machine learning can be incorporated, even if you don’t specify a search term, a name for example, your technology will learn and adapt and be able to identify that information based on context and syntax. This is especially useful for redaction of names of people or cities. AI will recognize pronouns and analyze their context to identify and extract names not specifically entered into your search. You can also create comprehensive master lists of such data for future use.

User Friendly Interface & Easy Work Product Review

AI does not require quite as extensive education as regex does. Most AI software vendors offer initial training and ongoing software support. Regex requires in-depth training on the proper coding procedures, troubleshooting, and more. 

Additionally, AI strives to emulate human behavior, as such, the redacted final product will be in a format that is easy to read. This is important because, even though AI is an extremely accurate method of review and automated redaction, you should always (especially at the beginning of implementation) review the software program’s work to check for potential errors, human or machine, and understand what future document review criteria may be helpful for the program to learn.

Ultimately, while regex may have been useful for more specific and traditional document reviews, AI is a better option for long term growth as the amount of unstructured data continues to grow exponentially. AI solutions get better over time through utilization and continuation of supervised and unsupervised machine learning. So as AI continues to learn and evolve and data gets broader and more complex, firms will see that regex will become an increasingly outdated and inaccurate method of review.

For data-rich industries like law, AI solutions using AI, natural language processing and machine learning will prevent costly document review mistakes, hasten the redaction process through automation, and allow firms to tailor their software review programs to their specific practice needs.

For more information on how AI can transform your document review process visit here or reach out.