How Good Is Your AI? Best Practices for Testing

Saied Seghatoleslami

Jan 21, 2019

The best Artificial Intelligence (AI) goes unnoticed. In a busy world that continues to move faster with each passing day, AI and automation are enabling people to keep up and stay connected by streamlining and improving everyday tasks. When AI operates correctly, no one notices. But when it goes awry, everyone notices —and quickly abandons the offending platform.

Consumers are far less forgiving of customer support technology than they are of slip-ups by their fellow humans. That’s why it’s absolutely imperative for companies to get their AI right before deploying in a consumer-facing manner. This is particularly true for conversational AI, which is designed to streamline a consumer’s communications and transactions with a company.

In a previous post, I detailed why conversational commerce solutions require more than just great AI. And that’s absolutely true. But great AI does need to be the foundation of any conversational interface. So how can you judge your AI’s greatness? That’s what we’re discussing here.

How to Gauge and Ensure Accuracy

There are a few measures by which you can judge the greatness of your AI, and the foremost is accuracy. As a calculation, accuracy is pretty simple. But ensuring this measure is meaningful is a bit more complicated.

Gauging accuracy requires you to feed your conversational AI a bunch of utterances. Some of the utterances are relevant, and you hope that your AI recognizes them as relevant and classifies them correctly. When it does, we call these true positives. The other utterances are irrelevant, and you hope that your AI recognizes them as such and disregards them. That’s called a true negative. Accuracy, quite simply, is the ratio of true positives and true negatives to the entirety of the utterances that were fed to the AI engine.

But there are some important caveats to getting accuracy right:

To properly test an AI engine, you have to feed it a good balance of what you expect it to recognize versus out-of-scope You need to ensure when the AI encounters data in the wild, it can recognize what it is and is not supposed to recognize.
You must keep training data and testing data separate. Training data is the data that is used to help the AI hone its understanding. Not surprisingly then, AI engines do really well with this data. Testing data needs to throw the engine some curve balls.
Similarly, it’s important to get your training data and testing data from separate, unbiased sources. If you build the training data and testing data personally, or as a part of a team working closely together, it’s likely that your training data and testing data will be very similar. As a result, your test results would likely come back looking optimistically good.
Both training and testing data have to recognize that native speakers of a language can still speak very differently than each other. AI engines must be adept at Natural Language Understanding and taught to account for regional and cultural differences in the way that they interpret questions.
For these reasons, it is very important to make sure that there is true variation in the training and testing data.

Accuracy Alone Is Not Enough

Accuracy is important. But on its own, it’s insufficient. That’s because there are innumerable possible utterances in the world that can be fed to an AI engine, and the vast majority of them are irrelevant to the job of the AI. So a simple way of faking accuracy would be to dump in a ton of random utterances. If the AI engine declared everything as irrelevant, it would be almost 100 percent accurate.

“Faking it” on AI accuracy is a recipe for long-term disaster. That’s why we need two other key measures to be a part of testing: precision and recall.

To calculate recall, you give the AI engine only relevant utterances. How many does it get right? That’s recall.

Then there’s precision. To calculate precision, look at all of the utterances that the AI engine declared as relevant and see how it classified them. How many did it get right? That’s precision.

In the case of both recall and precision, notice that you avoid giving the AI engine irrelevant utterances that lead to true negatives (and can be easily be padded). By doing this, as well as keeping testing data separate from training data and avoiding bias in the testing data, you make it a lot harder to cheat when it comes to determining the strength of your AI.

It’s human nature to want our carefully constructed systems to perform well when tested. But neglecting best practices in the pursuit of higher test scores will benefit absolutely no one in the long run. Remember: Bad AI will always be discovered. It’s better you discover it yourself. Because if you wait for consumers to discover it during their digital engagements, you will lose them forever.

Originally published in [AI]thority.

Get a demo

Partner with us

How Good Is Your AI? Best Practices for Testing

How to Gauge and Ensure Accuracy

Accuracy Alone Is Not Enough