Production Monitoring 4.0 in the Paper Industry

Production Monitoring 4.0 in the Paper Industry

Initial Situation:

How many people are necessary to operate a paper machine optimally? Due to the high degree of automation, actual operation is possible with a very small production team. Over the last 10 years, some paper manufacturers have increased the production volume per employee by a factor of 10! At the same time, paper production is and remains a complex dynamic process with many possible settings and influencing options in a complex production plant. Due to the high and still increasing number of sensors, a fully manual monitoring of the production process by only a few persons is impossible in practice. As a result, problems in the system or operating settings are often not detected. The consequences are unplanned downtimes and quality deterioration in the end product. In many cases, only time-consuming ex-post analyses are possible. Even though process control systems are offering alarming functionality, the checks made are rule-based using static limits without taking operating mode, grades or changes in settings into account. As a result, end users are flooded with alarms, which is why these alarming functions are usually only used to a very limited extent.

Smart Data Approach:

Fully automated and dynamic monitoring of thousands of process signals and alarms in case of unusual patterns in sensor data allow early identification of problems in production. With this new insight derived from data, downtimes can be prevented and product quality is improved. In the Smart Data alarming system, the normal behaviour of the machine is continuously dynamically derived from historical data, taking into account grades and operating modes. Dependent alarms are summarized and prioritized according to importance. In addition to sensor data, monitoring can also be flexibly applied to other data such as quality parameters or calculated indicators such as raw material consumption etc. Resulting alarms are presented in a user-friendly interface where they can be investigated and processed further by end-users with extended analysis functions.


  • Increase of OEE – potentially saving several hundred thousands of euro per year
  • Prevention of Downtimes
  • Improved quality of the final product
  • Predictive maintenance


  • Real-time Monitoring
  • Dynamic calculation of threshold values
  • Consideration of grades and production modes
  • Prioritization of alarms
  • Automated monitoring of raw material and energy consumption

Production Monitoring 4.0 in the Paper Industry – Reduced downtime, improved quality, predictive maintenance

Do you have any question? Simply email to:

Date: January 2020
Author: Leon Müller
© 2020 avato consulting ag
All Rights Reserved.

avato customer Steinbeis wins SAP Quality Award

avato customer Steinbeis wins SAP Quality Award

avato customer Steinbeis Papier GmbH wins SAP Quality Award 2019 in the category Innovation for Industrie 4.0 Project.

Industrie 4.0, IoT, Big Data, AI – just catchwords or ingredients for successful digitalization in medium-sized businesses? The focus on relevant business results, an excellent team combined with the intelligent use of modern technologies lead to convincing results for the medium-sized manufacturer of sustainably produced recycled paper. The jury of the SAP Quality Award 2019 therefore awarded the project Industrie 4.0@Steinbeis, supported by avato, gold in the Innovation category.  

If you would like to learn more about the project, Steinbeis Papier and the services of avato consulting, read on…

SAP Quality Award Gold in the category Innovation for Industrie 4.0 @ Steinbeis

SAP Quality Award Gold in the category Innovation for Industrie 4.0 @ Steinbeis

In the middle of 2017 Steinbeis Papier GmbH ( chose avato consulting ag as a strategic partner for its digitization initiative under the title Industrie 4.0 @ Steinbeis Papier. In a short preparation phase the goals were defined, possible use cases evaluated and prioritized for implementation in a roadmap.

The implementation started at the beginning of 2018. The technical platform was built up, more than 25.000 sensors were integrated, and the prioritized application scenarios were implemented in several release cycles.

The project led to impressive results for the medium-sized manufacturer of sustainably produced recycled paper. These not only convinced the Steinbeis management, but also the jury of the SAP Quality Awards 2019, and the project was awarded gold in the Innovation category in mid-December ( The jury was particularly impressed by how the joint team of Steinbeis and avato significantly improved the production and maintenance processes within a tight time and budget frame through an intelligent and innovative use of modern technologies and with a data-driven agile approach.

In the first project phase, the focus was on optimizing the production processes and the value chain. Especially the early automated detection of unusual events and procedures in the production facilities and the production process delivers significant optimizations in yield and quality, but also helps to prevent expensive machine and plant breakdowns.

Currently, the integrated database from production and business management applications is used with the advanced analytics and machine learning tools to implement various use cases in materials management, purchasing and controlling.

The solution used at Steinbeis is based on the avato Smart Data Framework. The in-memory database SAP HANA is used to collect data from the various production and quality systems and to analyze it in conjunction with information from the MES and SAP system. In addition, various modern machine learning algorithms and IT tools are used to quickly and cost-effectively process the large amounts of data into practically usable information.

If you would like to know more about the procedure, the tools used and the experiences, please contact us or browse through our blog.

XGBOOST: Differences between gbtree and gblinear

XGBOOST: Differences between gbtree and gblinear


The XGBoost framework has become a very powerful and very popular tool in machine learning. This library contains a variety of algorithms, which usually come along with their own set of hyperparameters. This allows to combine many different tunes and flavors of these algorithms within one package. We can model various classification, regression or ranking tasks by using trees and linear functions, by applying different regularization schemes and by adjusting many other aspects of the individual algorithms.

These options are governed by hyperparameters. They can be separated into two classes: parameters that set a model’s characteristics and parameters that adjust a model’s behavior. An example for the first type is the model’s objective, where we set the type of the prediction variable. Obviously, a binary classification task will have a different output than a numerical prediction. The second set of parameters manages the training process. For example, the learning rate, usually called eta, adjusts the information gain for each learning step and thus prevents overfitting.

This article focuses on two specific parameters, that appear to be entangled and might cause some confusion: the objective and the booster.

For more information on the full set of parameters see the official XGBoost documentation.


Short overview of two XGBoost parameters: booster and objective

  • The booster parameter sets the type of learner. Usually this is either a tree or a linear function. In the case of trees, the model will consist of an ensemble of trees. For the linear booster, it will be a weighted sum of linear functions.
  • The objective determines the learning task, thus the type of the target variable. The available options include regression, logistic regression, binary and multi classification or rank. This option allows to apply XGBoost models to several different types of use cases. The default value is “reg:squarederror” (previously called “reg:linear” which was confusing and was therefore renamed (see details)).

It should be noted, that the objective is independent of the booster. Decision trees are not only able to perform classification tasks but also to predict continuous variables with a certain granularity for the data input range used in the training.

Thus, the objective is always determined by the modeling task at hand, while the two common booster choices can be valid for the same problem.


Visualizing different boosters

To illustrate the differences between the two main XGBoost booster tunes, a simple example will be given, where the linear and the tree tune will be used for a regression task. The analysis is done in R with the “xgboost” library for R.

In this example, a continuous target variable will be predicted. Thus, the correct objective is “reg:squarederror”. The two main booster options, gbtree and gblinear, will be compared.

The dataset is designed to be simple. The input parameter x is a continuous variable, ranging from 0 to 10. No noise is added to keep the task easy. The target variable is generated from the input parameter:


The training data is chosen to be a subset of the full dataset, by selecting two subranges, [1:4] and [6:9]. This is illustrated in the figure below (yellow data points). By this, it can be tested how well the model behaves on unseen data.

With this training data, two XGBoost models are generated, m1_gbtree with the gbtree booster and m2_gblinear with the gblinear booster. The trained models are then used to generate predictions for the whole data set.

RMSE (full data)
MAE (full data)
RMSE (train data)
MAE (train data)
m1_gbtree 4.91 2.35 0.05 0.03
m2_gblinear 7.74 6.39 4.50 3.89

The predictions for the full dataset are shown in the plot above along with the full dataset. The first model, which uses trees, predicts the training data well in those regions, where the model was supplied with training data. However, in the outer regions (x<1  and x>9) as well as in the central region (4<x<6) discrepancies arise. The tree-based model replicates the prediction of the closest known datapoint, thus generating horizontal lines. This is always the case when trees are used for continuous predictions. No formula is learned which allows for inter- or extrapolation.

The second model uses a linear function for each learner in the gradient boosting process. The weighted combination of these learners is still a linear function. This explains the model’s behavior: The predictions follow a linear curve rather than the non-linear behavior of the data.

When looking at the metrics for the full dataset, the tree-based model shows a lower RMSE (4.9 versus 7.7) and MAE (2.4 versus 6.4) than the linear model. It should be noted, that the other hyperparameter of the models were not tuned and as a result, these numbers do not necessarily reflect the optimum. Nevertheless, they show how poor the models perform on the full dataset. The metrics that only consider training data reflect the differences in the modeling. The tree-based model represents the training data well, while the linear model does not. This is due to the fact, that the dependency of the target variable with the input variable is non-linear.

Can the models be improved, if a non-linear variable is supplied? As a test, each model is trained on modified input data, which is based on the original input variable as well as a new variable x int =x² .

The new variable contains the interaction term, which causes the non-linear behavior in the first place.

Additionally, a simple linear regression model is added for comparison, with and without interaction term.

RMSE (full data)
MAE (full data)
RMSE (train data)
m1_gbtree 4.91 2.35 0.05  
m6_gbtree_int 4.91 2.35 0.05  
m2_gblinear 7.74 6.39 4.50  
m3_gblinear_int 0.00 0.00 0.00  
m4_lin_reg 7.74 6.39 4.50  
m5_lin_reg_int 0.00 0.00 0.00  

From this we can learn a few things:

Firstly, the interaction term significantly improves the linear models. They show perfect agreement with the full dataset. Here, the regression function exactly models the true relation between input and target variables. In addition, this trained function allows to extrapolate well to unseen data.

Secondly, the tree based model did not improve by including the interaction term. This can be explained by considering again how trees work for a regression task. A tree splits the input space of the training data into fine categories, which are represented by its leaves. The prediction value for each leaf is learned from the target variable, thus the target variable is discretized. Adding more input variables refines the splitting of the input space. In this example, the original input variable x is sufficient to generate a good splitting of the input space and no further information is gained by adding the new input variable.

Finally, the linear booster of the XGBoost family shows the same behavior as a standard linear regression, with and without interaction term. This might not come as a surprise, since both models optimize a loss function for a linear regression, that is reducing the squared error. Both models should converge to the optimal result, which should be identical (though maybe not in every last digit). This comparison is of course only valid when using the objective “reg:squarederror” for the XGBoost model.



In this article, the two main boosters gblinear and gbtree of the XGBoost family were tested with non-linear and non-continuous data. Both boosters showed conceptual limits regarding their ability to extrapolate or handle non-linearity. Tree-based models allow to represent all types of non-linear data well, since no formula is needed which describes the relation between target and input variables. This is an enormous advantage if these relations and interactions are unknown. Linear models on the other hand cannot learn other relations than pure linear ones. If these additional interactions can be supplied, linear models become quite powerful.

The second aspect considered the fact that the training data does not always cover the full data range of the use case. Here, the model needs to inter- or extrapolate from known datapoints to the new regions. In the case of trees, no formula is available, which would allow to navigate these areas and provide meaningful prediction values. In contrast, this is the main advantage of linear regression models – if the same assumptions can be applied to the new data. In other words, if the new data behaves in the same way.


Do you have any question? Simply email to:

Date: December 2019
Author: Verena Baussenwein
© 2019 avato consulting ag
All Rights Reserved.

Handicaps of easy-to-use machine learning

Handicaps of easy-to-use machine learning

Nowadays machine learning is everywhere. The various types of models are applied to a wide variety of applications: They are used in classical predictive analysis, but also for image and speech recognition, in playing games like Jeopardy!, Go or World of Warcraft and are the backbone of autonomous driving cars.

Many different types of models are available now, from simple linear regression to boosted decision trees and various types of neural networks. Some of the models are dedicated to a specific task, for example word2vec for text processing, while others can easily be applied to all kinds of problems, like boosted decision trees.

The huge success of machine learning models over the past years has made them incredibly popular and they have gained a lot of interest not only inside the data science community but can also be found in almost any context that has to do with data. Part of this success comes from the fact that machine learning is easy to use. The code is usually openly available and comparably little computing power is required for the basic tasks. Thus, a chunk of data together with a regular notebook and a dedicated toolkit like R or python is sufficient to build a machine learning model.

The algorithms themselves are wrapped in convenient functions within the available machine learning libraries and many options exist to automate the different steps in the training. This has strongly simplified the full process. Only a few lines of code are needed and out comes a fancy machine learning model. The algorithms and sometimes even the data preparation and the application itself have become a black box, which tempts us to apply the methods without thinking much.

In addition, the model performance is usually evaluated by certain metrics like the RMSE, which boil the prediction quality down to a few numbers. Since those metrics are generally model independent and the data is usually quite complex, fully understanding and assessing the model outcome has become a challenge.

In this context, one might simply pick the model with the best metrics and implement it in the foreseen application. However, the model might perform poorly on new data. How did this happen?

In the following, three potential obstacles will be introduced, that just might lead to this scenario.


Data range

Let’s start with an obvious example from image classification to explain the problem at hand. An algorithm well suited for this task has been trained to identify images with animals present. However, the training data only consisted of cats and dogs. The full data will also include other animals like birds or fish. It is quite clear that the algorithm will not perform well, since the range of the test data vastly exceeds the range of the training data.

However, in most cases the discrepancy between training data and new data might be less obvious. Consider for example a numeric dataset with two variables x (input variable) and y (target variable). The goal is to model the linear dependence of the two variables by predicting the value of y for a given x. The dataset consists of datapoints in a range of x between 10 and 40 and between 60 and 80 (see the plot below, yellow data points).

A regression model has been trained on this data. What is the model’s behavior for new data that falls within the gap, from 40 to 60? The outcome strongly depends on the model. Some models might be capable to interpolate well, others might return senseless predictions. A regression model can be able to make good predictions by applying the trained formula, which might be valid in this region. A decision tree on the other hand did not learn anything about the behavior outside the range of the training data and simply predicts the value of the data point that is closest to the new data point. Unfortunately, no information stating the reliability of the individual predictions is provided out of the box, for example indicating whether the new data point lies within the training data’s range.


Data Quality

The second example is a rare event classification task. A classification algorithm is trained to distinguish between two classes, A and B. One class, A, is abundant, while the other, B, is quite rare. This example might come from credit card fraud detection, where a few illegal transactions (here class B) are embedded in data, that is mostly normal (class A). Let’s say that 2% of all events are fraud.

A corresponding model has been trained on this type of data and is applied to unseen data. It does very well on classifying new events of type A. However, it does not recognize any event of class B. Instead they are regarded falsely as class A as well. The model shows an accuracy of 98%, which determines how precise the predictions are. Actually, this number does not sound too bad. However, the precision of class B events is as low as 0, since none of these events have been labeled correctly.

One of the reasons for this might lie in the quality of the training data, which might not be good enough. This can be the case if the data is especially noisy and imbalanced. The true pattern, that allows to distinguish the few rare events of class B from the large amount of data from class A, is invisible within the general noise present. Thus, the training data does not represent the task at hand well. The only solution is to improve the data quality by collecting more events, that are better distinguishable or by cleaning the data and trying to reduce the noise.


Performance metrics

The usual metrics, which are calculated to assess a model’s performance, are the RMSE (root mean squared error) and MAE (mean absolute error ):

N denotes the number of data points, y the target variable and ȳ the prediction.

In the following example two models are compared to a given dataset, where the target variable y fluctuates around zero. The first model predicts the average of all datapoints, which is 0. The second model represents a sine curve.

For both models, the corresponding RMSE is 1 and MAE is 0. However, the models are far from identical. Which model does describe the data correctly? Without more information on the data, this is not clear at all. If the data is fluctuating for a reason, like temperate measured once during daytime and once every night, then the first model does not capture this. If the fluctuations are totally random, then the second model has clearly overfitted the data.



These examples highlighted only on a few of the pitfalls that come with black box modeling. Obviously there are many more, for example the sometimes large set of hyperparameters, which come along with most models and which provide different tunes or variants of the original model. However, choosing the correct set of parameters is not intuitive and might even lead to the wrong model.

As a summary, the following guidelines may help to avoid a few of the shortcomings of poor modeling.

  1. Training data coverage: The training data should cover the full input space that is to be expected for the use case. Otherwise, the model should be able to extrapolate or interpolate well to the unknown regions.
  2. Data Quality: Models, which are supposed to perform on noisy and/or unbalanced data, can be much improved, if the data is cleaned beforehand. This could include outlier removal, smoothing (noise) or resampling (imbalanced data).
  3. Model choice: While choosing the type of model, the model’s basic assumptions should be considered. For example, a linear model, like regression, is only valid to model a linear dependency between input and target variable. It will not consider interactions between different input variables.
  4. Hyperparametertuning: The model’s own parameters can significantly influence the performance. For example, overfitting can be prevented by adjusting a certain parameter. Hence, tuning the hyperparameters might significantly improve a model’s quality. Unfortunately, this task requires a fair amount of knowledge on the hyperparamters and a lot of computing power and time to test as many combinations of hyperparameters as possible.
  5. Crosschecking the results: It is crucial to not only rely on the main performance metrics but also to have a look at the predictions. Residuals and time series plots can be immensely helpful.

And as always, a careful and skeptical look at everything is a good starting point.

Do you have any question? Simply email to:

Date: December 2019
Author: Verena Baussenwein
© 2019 avato consulting ag
All Rights Reserved.

avato Smart Data Method – Procedural Model for Smart Data Projects (Whitepaper)

avato Smart Data Method – Procedural Model for Smart Data Projects (Whitepaper)

Why Smart Data?

Big Data, Advanced Analytics, Industry 4.0, Internet of Things, self-learning machines – all these topics offer plenty of substance for fascinating new ideas. These may be more efficient processes, new or optimised products and services or entire new business models.

The use of these new tools and technologies – in whatever form – is now within reach and often necessary for most businesses of any size and in any sector.

avato Service Offering

avato offers clients a comprehensive service range to create innovations and relevant business results from available data – all under the header of Smart Data. More details on these topics and some use cases are included in this white paper. It describes a procedural model for the initiation of Smart Data projects and offers pointers for key success factors as well as problem areas observed in practical use cases.

Many Types of Smart Data Projects

Smart data projects cover a broad spectrum of topics. They can be triggered by concrete issues that require data-based solutions or may entail strategic initiatives under the header Industry 4.0, Big Data, the use of Artificial Intelligence etc. Project objectives may vary – from the development of ideas for Smart Data implementation and strategic roadmaps to prototypes or go-lives for individual use cases, to the development of comprehensive Smart Data and IT architectures, including their structural and procedural organisation. There is one aspect that all smart data projects have in common, no matter their specific trigger, objective or proposed scope: they all need a well-structured and systematic approach to be successful.

The avato Smart Data Procedural Model

Almost all of the traditionally used methods for projects in the area of Big Data, Advanced Analytics, Data Science, etc., focus on the actual data analysis process and are mostly based on the CRISP-DM model. This procedural model was designed for the actual data analysis process and is well-suited for that purpose. A practice-oriented and structured approach must take into account other aspects that will be essential for success. These specifically include business, IT, data governance and security aspects, as well as appropriate project and change management in addition to data-analytical aspects. All that must then be viewed against the background of the ‘Smart Data maturity’ of the business and the powerful dynamics across the entire market.

avato has developed a Smart Data procedural model that considers and includes all these aspects. It delivers a structured framework, which – depending on the initial situation, the project objectives and the various basic conditions – is then adapted to the client and project situation and translated into a project plan.

We recommend proceeding in the following logical phases:

The beginning is marked by the structured development of a plan (alignment & discovery). Next comes the implementation of one or more application scenarios – often also following feasibility studies or prototypes (proof of concept) – including a go-live deployment and subsequent operations and optimisation phases. Another important component is an appropriate project and change management right from the get-go. Experience shows that Smart Data projects often initially produce unexpected challenges. They require the close cooperation of people from areas that traditionally have little or no contact and may be accustomed to sometimes very different work methods and ‘languages’. The fear of far-reaching change in the wake of these projects among employees should not be underestimated.


Success-Critical Expert Domains & Project Roles

Smart Data projects require teamwork. Business and process expertise, data science and data management expertise, IT expertise for the existing business IT and special Advanced Analytics IT topics must all be represented. As with any interdisciplinary team projects of some complexity, project managers are also needed who can motivate a team and provide structured leadership. Depending on the objective and environment, change management experts may also play an important role.

avato offers the specialist expertise you need for Smart Data projects:

Data engineers and data scientists obviously play central roles, but so do business consultants, IT architects, developers and even project managers need special expertise for Smart Data projects. Physicists, IT experts and business data processing specialists form the core of our avato Smart Data team. We expect our clients to make available a project lead in addition to a project sponsor from their management team, as well as the provision of expertise in the relevant business or production process, plus the involvement of relevant in-house IT experts. We will then put together a project and task-specific avato team in close collaboration with you to fill the required project roles with high-quality resources. The senior level of our consultants will keep the team manageable, while increasing its effectiveness and efficiency.

Organisational Aspects

The organisational handling of responsibilities and processes with regard to master data, data security, privacy, data quality, etc. varies from client to client. With the rise of Smart Data, the importance of these topics will continue to increase and will – at least in the mid-term – require organisational adjustments.

A proof of concept for some Predictive Analytics use case scenarios will certainly not require organisational changes within the company with regards to data organisation. For strategic initiatives and the increasing use of Big and Smart Data, however, an early discussion of how a company should react to the continuously growing importance of data and associated internal and external requirements in terms of organisational changes would be prudent.

New specialist roles like data engineers and data scientists must be integrated into the organisation in a way that allows them to add their value efficiently. Last but not least, the IT organisation will be facing new tasks – specifically in the operation of new Smart Data applications.

We will help you to find solutions to suit your specific situation.

avato Smart Data – Procedural Model for Smart Data Projects (PDF)
The avato Smart Data process model. For the complete version of the white paper please download the PDF.
Please send us an e-mail for questions or feedback:


Date: November 2019
Author: Wolfgang Ries
© 2019 avato consulting ag
All Rights Reserved.