Nowadays machine learning is everywhere. The various types of models are applied to a wide variety of applications: They are used in classical predictive analysis, but also for image and speech recognition, in playing games like Jeopardy!, Go or World of Warcraft and are the backbone of autonomous driving cars.

Many different types of models are available now, from simple linear regression to boosted decision trees and various types of neural networks. Some of the models are dedicated to a specific task, for example word2vec for text processing, while others can easily be applied to all kinds of problems, like boosted decision trees.

The huge success of machine learning models over the past years has made them incredibly popular and they have gained a lot of interest not only inside the data science community but can also be found in almost any context that has to do with data. Part of this success comes from the fact that machine learning is easy to use. The code is usually openly available and comparably little computing power is required for the basic tasks. Thus, a chunk of data together with a regular notebook and a dedicated toolkit like R or python is sufficient to build a machine learning model.

The algorithms themselves are wrapped in convenient functions within the available machine learning libraries and many options exist to automate the different steps in the training. This has strongly simplified the full process. Only a few lines of code are needed and out comes a fancy machine learning model. The algorithms and sometimes even the data preparation and the application itself have become a black box, which tempts us to apply the methods without thinking much.

In addition, the model performance is usually evaluated by certain metrics like the RMSE, which boil the prediction quality down to a few numbers. Since those metrics are generally model independent and the data is usually quite complex, fully understanding and assessing the model outcome has become a challenge.

In this context, one might simply pick the model with the best metrics and implement it in the foreseen application. However, the model might perform poorly on new data. How did this happen?

In the following, three potential obstacles will be introduced, that just might lead to this scenario.


Data range

Let’s start with an obvious example from image classification to explain the problem at hand. An algorithm well suited for this task has been trained to identify images with animals present. However, the training data only consisted of cats and dogs. The full data will also include other animals like birds or fish. It is quite clear that the algorithm will not perform well, since the range of the test data vastly exceeds the range of the training data.

However, in most cases the discrepancy between training data and new data might be less obvious. Consider for example a numeric dataset with two variables x (input variable) and y (target variable). The goal is to model the linear dependence of the two variables by predicting the value of y for a given x. The dataset consists of datapoints in a range of x between 10 and 40 and between 60 and 80 (see the plot below, yellow data points).

A regression model has been trained on this data. What is the model’s behavior for new data that falls within the gap, from 40 to 60? The outcome strongly depends on the model. Some models might be capable to interpolate well, others might return senseless predictions. A regression model can be able to make good predictions by applying the trained formula, which might be valid in this region. A decision tree on the other hand did not learn anything about the behavior outside the range of the training data and simply predicts the value of the data point that is closest to the new data point. Unfortunately, no information stating the reliability of the individual predictions is provided out of the box, for example indicating whether the new data point lies within the training data’s range.


Data Quality

The second example is a rare event classification task. A classification algorithm is trained to distinguish between two classes, A and B. One class, A, is abundant, while the other, B, is quite rare. This example might come from credit card fraud detection, where a few illegal transactions (here class B) are embedded in data, that is mostly normal (class A). Let’s say that 2% of all events are fraud.

A corresponding model has been trained on this type of data and is applied to unseen data. It does very well on classifying new events of type A. However, it does not recognize any event of class B. Instead they are regarded falsely as class A as well. The model shows an accuracy of 98%, which determines how precise the predictions are. Actually, this number does not sound too bad. However, the precision of class B events is as low as 0, since none of these events have been labeled correctly.

One of the reasons for this might lie in the quality of the training data, which might not be good enough. This can be the case if the data is especially noisy and imbalanced. The true pattern, that allows to distinguish the few rare events of class B from the large amount of data from class A, is invisible within the general noise present. Thus, the training data does not represent the task at hand well. The only solution is to improve the data quality by collecting more events, that are better distinguishable or by cleaning the data and trying to reduce the noise.


Performance metrics

The usual metrics, which are calculated to assess a model’s performance, are the RMSE (root mean squared error) and MAE (mean absolute error ):

N denotes the number of data points, y the target variable and ȳ the prediction.

In the following example two models are compared to a given dataset, where the target variable y fluctuates around zero. The first model predicts the average of all datapoints, which is 0. The second model represents a sine curve.

For both models, the corresponding RMSE is 1 and MAE is 0. However, the models are far from identical. Which model does describe the data correctly? Without more information on the data, this is not clear at all. If the data is fluctuating for a reason, like temperate measured once during daytime and once every night, then the first model does not capture this. If the fluctuations are totally random, then the second model has clearly overfitted the data.



These examples highlighted only on a few of the pitfalls that come with black box modeling. Obviously there are many more, for example the sometimes large set of hyperparameters, which come along with most models and which provide different tunes or variants of the original model. However, choosing the correct set of parameters is not intuitive and might even lead to the wrong model.

As a summary, the following guidelines may help to avoid a few of the shortcomings of poor modeling.

  1. Training data coverage: The training data should cover the full input space that is to be expected for the use case. Otherwise, the model should be able to extrapolate or interpolate well to the unknown regions.
  2. Data Quality: Models, which are supposed to perform on noisy and/or unbalanced data, can be much improved, if the data is cleaned beforehand. This could include outlier removal, smoothing (noise) or resampling (imbalanced data).
  3. Model choice: While choosing the type of model, the model’s basic assumptions should be considered. For example, a linear model, like regression, is only valid to model a linear dependency between input and target variable. It will not consider interactions between different input variables.
  4. Hyperparametertuning: The model’s own parameters can significantly influence the performance. For example, overfitting can be prevented by adjusting a certain parameter. Hence, tuning the hyperparameters might significantly improve a model’s quality. Unfortunately, this task requires a fair amount of knowledge on the hyperparamters and a lot of computing power and time to test as many combinations of hyperparameters as possible.
  5. Crosschecking the results: It is crucial to not only rely on the main performance metrics but also to have a look at the predictions. Residuals and time series plots can be immensely helpful.

And as always, a careful and skeptical look at everything is a good starting point.

Do you have any question? Simply email to:

Date: December 2019
Author: Verena Baussenwein
© 2019 avato consulting ag
All Rights Reserved.