Professor of Engineering at Columbia University, and Text IQ Advisor
My favorite machine learning study is not from a machine learning journal, but from a study about the Ancient Chinese. This says a lot about the versatility of AI.
Professor Bol is the Vice Provost at Harvard, a professor of East Asian Languages and Civilizations, and the Chair of Harvard’s China Biographical Database Project, which is building a visual catalog of all of China’s biographical information, based on records that stretch back to the 9th century.
I attended a talk where Professor Bol described an unsolved mystery he arrived at in one particular ancient Chinese region, whose societal inner workings seemed to have been lost to time. He and his researchers were struggling to piece together how people advanced through this society to achieve favor and coveted positions. Professor Bol's team had a lot of data in the form of thousands of hand-written letters, but they couldn't make sense of it. That's because the social understanding that animated these letters was intuitive and below the surface, based on an invisible shared knowledge.
The ancient Chinese in this region knew “how things worked.” They knew who the big-shots were, how to approach them, and all the nuances of what to say and why. Eventually these unwritten rules disappeared, but somewhere in the pile of letters, the traces of the social conventions remained—a hidden trove of analog big data.
Stuck and intrigued, the researchers teamed up with artificial intelligence practitioners and fed the data through an unsupervised learning algorithm, and what it revealed were things that humans couldn’t have figured out on their own, or at least not without an impossibly large amount of time and patience. The model showed clear patterns of how people changed roles in the society. It revealed the power structure by threading a small network of nodes into a constellation. These were the power brokers the researchers were looking for.
AI had advanced our understanding of ancient human behavior. This begs the question: what else can AI advance?
Professor Bol was using unsupervised learning for its best possible use case: to tell him what he didn’t know he needed to know. Supervised machine learning can’t do that, because it only looks at what we tell it to look at. Unsupervised learning looks at everything.
Two years ago, I joined an AI company as an advisor. Text IQ is building AI for sensitive information to help enterprises and government agencies reduce their latent risk. The problem that Text IQ is working on is a multi-billion dollar problem: the wrong kind of information that slips to the public or a competitor can topple a merger, cause huge fines, destroy a high-stakes litigation, and create reputational harm.
Like the ancient Chinese who left traces of their shared knowledge in the public record, today’s humans leave traces of risk in the massive datasets of their organizations. These are sensitive needles in a haystack: codewords describing insider trading, delicate private emails from professional accounts, "hot documents" that can evince anti-competitive conduct, and the vestiges—everywhere—of personal data that now spells significant liability in the face of privacy regulations like GDPR and CCPA.
The compliance and risk spaces have become crowded with startups that are using artificial intelligence to help organizations protect themselves from these inside threats. Many of these startups are using supervised machine learning, because it is generally faster and less complex to develop.
At Text IQ, we are using unsupervised machine learning. We believe this technique provides significantly better results for reducing the risk, time, and cost of detecting sensitive information.
Here are three key advantages of using unsupervised machine learning to manage latent risk.
1. Unsupervised machine learning sees things we don’t tell it to see
Supervised machine learning is the basis for some of the most powerful AI technologies: speech recognition (“What did this person say?”), pharmaceutical research (“Is this molecule a drug?”), and self-driving cars (“Is this a stop sign?”). In these applications, we take what’s seen in the real world and we tell the model how to interpret what it saw. The function takes an input and makes a prediction about the output, and by minimizing the error of that function, we’re trying to get things right as often as we can.
The benchmarks for supervised learning are clear. Did it interpret the person’s speech correctly? Did it identify the drug? Did it stop at the stop sign? These are quantitative measures that state the fraction of what the model misses.
However, for use cases that call for us to understand human-generated meaning, there's a built-in limit to what supervised learning can imagine, because supervised algorithms only see what they’re allowed to see (e.g. here’s an image), in order to make a prediction it’s instructed to make (e.g. is there a human in it?).
In a sense, unsupervised learning is held to a higher standard, because its end goal is loftier. Rather than making a prediction, it is asked to extract useful and actionable information. If supervised learning is a laser, then unsupervised learning is a floodlight, ingesting all the data there is and surfacing buried patterns.
Its expansive “data view” is why unsupervised learning can provide deeper insights into the hidden meanings and relationships in the artifacts of human interactions. This is true whether the artifact is an ancient Chinese letter or an enterprise chatlog.
2. Unsupervised machine learning can more easily leverage big datasets
Supervised learning requires examples of correct answers to learn how to accurately predict outputs for given inputs. For example, an algorithm that can detect email spam requires a large set of emails that are labeled spam or not spam. The process for labeling this data is slow and expensive.
One way to address these time and cost challenges is to crowd-source data labeling. If we give our email provider access, it can learn about those emails we mark as spam and those we don’t. We can also hire humans to tag emails manually. However, for many use cases, and in the enterprise space specifically, crowd-sourcing data labeling is impossible, because the information the data contains is sensitive, proprietary, or esoteric.
Unsupervised learning side-steps all these challenges. Because it simply looks for patterns in data, unsupervised learning doesn’t require a “cheat sheet” of labeled data. Also, unsupervised learning can lead us to a different kind of label: labeled patterns rather than labeled data.
Discovering new patterns is possible with unsupervised learning because the model usually uncovers a structure that is interpretable. An expert can make meaningful explanations about what the patterns mean and how they inter-relate, and label them.
These labeled patterns can lead us to a labeled dataset, because we can look at which patterns any data point exhibits, and label that data point accordingly. This is a much easier process than the other way around, because the volume of patterns is far less than the volume of the data.
The better the model is designed, the better the patterns it will discover, and the deeper the insights it will produce. Designing sophisticated unsupervised models requires deep data science experience, along with some art.
3. Unsupervised machine learning keeps humans in the loop
It is hard to imagine how unsupervised learning could work without a human in the loop. Somehow it has to provide information to a human, almost like an advisor saying, “Look at what I found.” Supervised machine learning, on the other hand, is often designed to take humans out of the loop. Self-driving cars are the clearest examples of this, and also one of the most ambitious.
However, in highly regulated industries, it’s critical to have humans in the loop, so we can be the ones to consider the early results, perform the nuanced analyses, and make the final calls.
The algorithm used in Professor Bol’s study didn’t care that the letters tended to flow to certain power centers. When it found that there were dominant nodes, it didn’t ask why. It only said: look over here at this clear pattern. And when the researchers traced these patterns, the ancient, unwritten rules began to reveal themselves, like invisible ink under lemon juice.
Text IQ utilizes unsupervised machine learning to reach multi-dimensional insights like these: not just the sensitive needle in the haystack, but the story of how it got there.