With Artificial Intelligence, everything is possible today, isn’t it? We have Machine Learning and Neural Networks and all that stuff. Machines can help your customers online via chatbot (if they need help with the right things), categorize text by topic (sometimes) and tell what’s depicted in an image (okay, that works quite well… usually).
Yes, of course there is still a lot of work to do. We need to improve the accuracy, we need a bit more computing power and we have to talk about the social implications. But beside that, we can use AI to solve all our business problems!
Why? There are a lot of answers to that, from legal issues to result-interpretation to infrastructure to cost. But the one problem that affects every single AI is rarely mentioned: data. The reason for that: The leaders in AI technology – Universities, Google, Apple and so on – don’t have that problem. But if you start using AI, you will run into it.
See, all AI must be trained first. You need training data for that. Then you must test the quality of the models produced during training. You need separate testing data for that. And to ensure it is really working, you must feed some validation data to the model. And don’t think it is done with some small datasets for each of those tasks. Especially the youngest generation of AI technologies needs lots and lots of data. Depending on the variance within your data, you may want to start with several hundreds of records, better thousands.
But there is Big Data! Or we could use our company’s databases, data shares and SharePoints!
How will you gather this data? It is often hard to find a single source providing enough of the data you need. Combining several sources is even harder since all records must be in the same format. If you are planning to build an AI, be aware that you might end up investing more time into the preprocessing than in actually building and testing your models. You will also need to involve some specialists for the data you are processing to tell you how to structure your data.
Oh, okay, so we will do all that. May take a while, but at some point we will have enough data and then we can feed it to the algorithm.
You will get a result and at first glance it might even look marvelous. 97% of records classified correctly by your AI for categorizing the documents on your intranet into “useful” and “garbage”. That’s great! Until you recognize that the AI just decided to label all documents as garbage, independent from their quality.
That is what happens if your dataset is unbalanced. In the example case, only 3% of it were high quality documents, thus by putting all the documents in one class the AI reached a high accuracy and an even higher degree of uselessness. Balancing your dataset is a difficult task. You need enough examples for every class and a representative amount of variance within the data. If all training records for one class are from the same author, from the same time period, on the same topic or whatever connections might be hidden within your data, the model will work in testing, but will completely fail in operation.
There are techniques that to some degree can handle the bias resulting from unbalanced data. But you should not rely too heavily on them. Thus, you don’t just need thousands of records – you need thousands of records for every single class your AI should identify. And you need a subject matter expert to help you uncover the hidden dependencies.
*Sigh*… okay. We collected the data, we processed it and we ensured the set is balanced. But now, finally, AI can solve the problem we initially set out to solve?
Well, congratulations, you achieved something that can take years and, in many cases, is even impossible. Beside that… No. But you finally reached the point where you can start coping with all the issues of AI itself! Find the best algorithm for your use case, set up the necessary infrastructure, solve the model selection problem, find an efficient way for interpreting the results and learn how to take advantage of your findings. Then AI will help you to solve this one single business Problem.