Handicaps of easy-to-use machine learning

Handicaps of easy-to-use machine learning

Nowadays machine learning is everywhere. The various types of models are applied to a wide variety of applications: They are used in classical predictive analysis, but also for image and speech recognition, in playing games like Jeopardy!, Go or World of Warcraft and are the backbone of autonomous driving cars.

Many different types of models are available now, from simple linear regression to boosted decision trees and various types of neural networks. Some of the models are dedicated to a specific task, for example word2vec for text processing, while others can easily be applied to all kinds of problems, like boosted decision trees.

The huge success of machine learning models over the past years has made them incredibly popular and they have gained a lot of interest not only inside the data science community but can also be found in almost any context that has to do with data. Part of this success comes from the fact that machine learning is easy to use. The code is usually openly available and comparably little computing power is required for the basic tasks. Thus, a chunk of data together with a regular notebook and a dedicated toolkit like R or python is sufficient to build a machine learning model.

The algorithms themselves are wrapped in convenient functions within the available machine learning libraries and many options exist to automate the different steps in the training. This has strongly simplified the full process. Only a few lines of code are needed and out comes a fancy machine learning model. The algorithms and sometimes even the data preparation and the application itself have become a black box, which tempts us to apply the methods without thinking much.

In addition, the model performance is usually evaluated by certain metrics like the RMSE, which boil the prediction quality down to a few numbers. Since those metrics are generally model independent and the data is usually quite complex, fully understanding and assessing the model outcome has become a challenge.

In this context, one might simply pick the model with the best metrics and implement it in the foreseen application. However, the model might perform poorly on new data. How did this happen?

In the following, three potential obstacles will be introduced, that just might lead to this scenario.

 

Data range

Let’s start with an obvious example from image classification to explain the problem at hand. An algorithm well suited for this task has been trained to identify images with animals present. However, the training data only consisted of cats and dogs. The full data will also include other animals like birds or fish. It is quite clear that the algorithm will not perform well, since the range of the test data vastly exceeds the range of the training data.

However, in most cases the discrepancy between training data and new data might be less obvious. Consider for example a numeric dataset with two variables x (input variable) and y (target variable). The goal is to model the linear dependence of the two variables by predicting the value of y for a given x. The dataset consists of datapoints in a range of x between 10 and 40 and between 60 and 80 (see the plot below, yellow data points).

A regression model has been trained on this data. What is the model’s behavior for new data that falls within the gap, from 40 to 60? The outcome strongly depends on the model. Some models might be capable to interpolate well, others might return senseless predictions. A regression model can be able to make good predictions by applying the trained formula, which might be valid in this region. A decision tree on the other hand did not learn anything about the behavior outside the range of the training data and simply predicts the value of the data point that is closest to the new data point. Unfortunately, no information stating the reliability of the individual predictions is provided out of the box, for example indicating whether the new data point lies within the training data’s range.

 

Data Quality

The second example is a rare event classification task. A classification algorithm is trained to distinguish between two classes, A and B. One class, A, is abundant, while the other, B, is quite rare. This example might come from credit card fraud detection, where a few illegal transactions (here class B) are embedded in data, that is mostly normal (class A). Let’s say that 2% of all events are fraud.

A corresponding model has been trained on this type of data and is applied to unseen data. It does very well on classifying new events of type A. However, it does not recognize any event of class B. Instead they are regarded falsely as class A as well. The model shows an accuracy of 98%, which determines how precise the predictions are. Actually, this number does not sound too bad. However, the precision of class B events is as low as 0, since none of these events have been labeled correctly.

One of the reasons for this might lie in the quality of the training data, which might not be good enough. This can be the case if the data is especially noisy and imbalanced. The true pattern, that allows to distinguish the few rare events of class B from the large amount of data from class A, is invisible within the general noise present. Thus, the training data does not represent the task at hand well. The only solution is to improve the data quality by collecting more events, that are better distinguishable or by cleaning the data and trying to reduce the noise.

 

Performance metrics

The usual metrics, which are calculated to assess a model’s performance, are the RMSE (root mean squared error) and MAE (mean absolute error ):

N denotes the number of data points, y the target variable and ȳ the prediction.

In the following example two models are compared to a given dataset, where the target variable y fluctuates around zero. The first model predicts the average of all datapoints, which is 0. The second model represents a sine curve.

For both models, the corresponding RMSE is 1 and MAE is 0. However, the models are far from identical. Which model does describe the data correctly? Without more information on the data, this is not clear at all. If the data is fluctuating for a reason, like temperate measured once during daytime and once every night, then the first model does not capture this. If the fluctuations are totally random, then the second model has clearly overfitted the data.

 

Conclusion

These examples highlighted only on a few of the pitfalls that come with black box modeling. Obviously there are many more, for example the sometimes large set of hyperparameters, which come along with most models and which provide different tunes or variants of the original model. However, choosing the correct set of parameters is not intuitive and might even lead to the wrong model.

As a summary, the following guidelines may help to avoid a few of the shortcomings of poor modeling.

  1. Training data coverage: The training data should cover the full input space that is to be expected for the use case. Otherwise, the model should be able to extrapolate or interpolate well to the unknown regions.
  2. Data Quality: Models, which are supposed to perform on noisy and/or unbalanced data, can be much improved, if the data is cleaned beforehand. This could include outlier removal, smoothing (noise) or resampling (imbalanced data).
  3. Model choice: While choosing the type of model, the model’s basic assumptions should be considered. For example, a linear model, like regression, is only valid to model a linear dependency between input and target variable. It will not consider interactions between different input variables.
  4. Hyperparametertuning: The model’s own parameters can significantly influence the performance. For example, overfitting can be prevented by adjusting a certain parameter. Hence, tuning the hyperparameters might significantly improve a model’s quality. Unfortunately, this task requires a fair amount of knowledge on the hyperparamters and a lot of computing power and time to test as many combinations of hyperparameters as possible.
  5. Crosschecking the results: It is crucial to not only rely on the main performance metrics but also to have a look at the predictions. Residuals and time series plots can be immensely helpful.

And as always, a careful and skeptical look at everything is a good starting point.

Do you have any question? Simply email to: marketing@avato.net

Imprint: 
Date: December 2019
Author: Verena Baussenwein
Contact: marketing@avato.net
www.avato-consulting.com
© 2019 avato consulting ag
All Rights Reserved.

10 years avato consulting ag

10 years avato consulting ag

Founded in Alzenau and still based there today, avato has been convincing for a whole decade with innovation and know-how in data and database technology. Founded in 2007, avato quickly gained momentum as a start-up despite the financial crisis and can still report steady growth today. avato offers professional consulting in international project environments with more than 70 consultants. In particular in the areas of Smart Data, IT Information Management and Expertise-as-a-service, competences will be expanded and international projects carried out in the future. Here, it is also about bringing people from different countries together and offering them an exciting perspective.

Fair Company? avato!

Fair Company? avato!

We are proud that we have received a mention for being a Fair Company. To receive this mention we committed ourselves that we will not replace full time jobs by apprenticeshipswill not put graduates off with an apprenticeship if they have applied for permanent employment positionoffer apprenticeships primarily for education/training.

You will find more information regarding avato as a Fair Company here.

https://www.faircompany.de/unternehmen/profil/c/avato-consulting-ag

Best Practices in the Exadata Life Cycle (PDF)

Best Practices in the Exadata Life Cycle (PDF)

Whether as a basis for large and transaction-rich OLTP, as a Data Warehouse or also as a consolidation platform – more and more often, Exadata is being used as a central part of the IT infrastructure. For the first time we have written a detailed Whitepaper about practical experiences and recommendations in each phase of Exadata.

Best Practices in the Exadata Life Cycle (PDF)
For the first time we have written a detailed Whitepaper about practical experiences and recommendations in each phase of Exadata.

Industrial Software Integration (Whitepaper)

Industrial Software Integration (Whitepaper)

To ensure industrialization of an enterprise IT environment runs smoothly, the ideal starting point is right at the end of the application life cycle

Shorter time-to-market, higher quality, lower cost: companies that industrialize their software integration activities can achieve total savings of up to 50 percent. Organizations with heterogeneous IT environments that have grown over the years can benefit particularly. But these companies also need to be especially careful when adapting their processes to new approaches so as to ensure that they remain fully functional at all times.

Build and deployment in particular is a central process with links to every other operation in the software development process. 
So it is software integration, positioned late in the application life cycle, that is the ideal starting point for introducing standards and defining appropriate workflows. Optimizations are introduced with every new software release, gradually expanding the coverage of standardization. These standards lay the foundations for subsequent industrialization activities – with a view to implementing requirements such as automation, traceability, revision security and reproducibility of processes within internal IT projects or reducing vertical integration.

Optimum Strategies

In practice, certain measures almost always prove useful for improving efficiency, regardless of the initial situation. These include the introduction of fixed release deadlines and implementation of development and software integration standards. Virtualization of integration and test environments, harmonization of infrastructures and implementation of automation are also worthwhile for practically every organization.

Nevertheless, it is a good idea to analyze the individual context carefully before beginning. Which operating systems are currently in use? Which application servers and third-party tools? Who develops what software, and who integrates it? Precise information about cost drivers and their location in the process structure is needed, as are details of possible reservations from stakeholders – because an optimization strategy is only right if it reflects the real starting position.

Leverage Economies of Scale

Economies of scale often result from the introduction of technology standards: the client-server environment in particular is frequently host to an astonishing variety of hardware and operating systems. Consolidating them reduces maintenance and expansion outlay significantly. Application servers and queuing are other good places to start: in these areas, two technologies – and in the case of queuing, often just one – are usually sufficient to cover all requirements. Specifying clear requirements for application development at this stage simplifies subsequent integration significantly, and, because costly installation, configuration and maintenance expertise is required for two systems at most, saves on staff costs in particular.

There is one area offering even greater scope for software integration optimization than technology, however, and that is processes. For example, avato consulting has often seen skillful standardization and automation reduce the build phase to around a third of the original duration. With industrialized processes, integration of a release package can be achieved in a quarter of the previous time – and because the entire process is broken down into clear steps, it is easier to identify any problems at an early stage. By skillfully combining all options for optimization, in most cases the number of deployments can be reduced more than tenfold without increasing outlay.

Only One Way In

The starting point for – and possible beneficiary of – process standardization is the interface via which the software enters the company. Regardless of whether applications are developed by external service providers or an in-house team, full transparency and complete control are only possible when the path of the source code into the software integration process is clearly defined. Organizations which outsource their application development can derive particular benefit from defined processes and infrastructure standards at this point, because they enable releases to be supplied in easy-to-integrate installation packages. If required, the user organization can outsource the build process and even source code control and storage to the supplier, meaning that it no longer needs to maintain control, storage and personnel resources of its own.

On the other hand, organizations that develop their software internally should keep a close eye on configuration management. A standardized process in this area, simply and clearly structured, is not only essential to permit the use of time-saving parallel development branches; it is also an indispensable foundation for automation of downstream processes. Application of the “development in head” concept in configuration management is generally particularly effective.

One Package for All

Organizations which convert their software into installable packages themselves instead of outsourcing this can also increase their internal efficiency by creating just one package which can be installed in all target environments. Various structural aspects must be attended to in order to allow the software to be used in system testing, integration testing and production environments without creating a new package for each, which incurs additional outlay and risks errors.

Environment and software parameters are used to customize the package to various environments without changing the software itself. The necessary information is supplied in documented form on delivery and will ideally be processed automatically to save time and prevent errors. Clearly structured and standardized data are essential for containing parameterization costs – the simple process usually pays off within a few weeks.

Consider the Time Factor

Configuration information for the underlying technologies, such as application servers and queuing, is also supplied in standardized format. The question of time should be considered here, particularly when extensive changes are required in advance of a deployment: the integration team will need this information at least one to two weeks in advance of the actual software integration. This is where infrastructure consolidation in particular really pays off, because workloads are multiplied when every technology needs to be configured for multiple environments. 
Once software and configuration have been standardized, the way is clear for the creation of an installable package. Here too it is a good idea to stick to a small number of systems: most build tools specify a workflow and integration standards, and these should be the same for every release. A single tool is generally sufficient to meet all requirements. Build scripts should also be version controlled – not least because this ensures revision security. Versioning also allows fixes for production to be passed on to new releases automatically and ensures transparency at all times, even when working with multiple parallel development branches.

Outsourcing for Greater Efficiency

Deployment can be standardized using a similar process as for the build process: a single universal installation routine unites all the central steps required for deployment. This overlay component is the same for every project and is integrated into every software package during the build process. Not only is this approach less time-consuming and less susceptible to error than a process of manual customization for each individual application – any changes are also implemented to the central overlay only, from where they are incorporated automatically into every package at the next build.

A further automated component in a well-engineered deployment process is a module containing check mechanisms, which is added to the overlay to check items such as the availability of application servers and target containers, databases, individual target schemas and the required queues. Then, if errors occur during delivery despite successful testing, their source is already localized: it probably lies somewhere in the infrastructure. It is also a good idea, both directly following deployment and at regular intervals during runtime, to run automated tests on all items that are important for the application and to generate detailed reports that provide continuous documentation of the functionality of the infrastructure and of important application components.

Automate Wisely

The ultimate objective of all these standardization activities is to achieve an appropriate degree of automation, as it is this that will bring the most significant productivity gains – but only if all parties involved, external suppliers as well as internal development and support units, comply with agreed standards. Therefore, for management and control purposes, it is advisable to implement a single ticketing and tracking tool. This tool manages workflows for all technical processes across all teams and departments, guarantees transparency of the associated data and ensures all parties involved comply with prescribed procedures. If standards are in place, software integration offers numerous starting points for automation, from the software acceptance process via build and deployment to automated testing.

Efficient Acceptance

An example of a process that can be almost completely automated is software acceptance. An internal or external supplier provides its artifacts to the customer in a defined condition and at the required location. The company’s release manager then triggers the acceptance process, and an automated operation combines software and installation artifacts into an installable package. Then come test deployment to a virtual environment and the “smoke test”. If all four steps are completed successfully, the package is deemed to have been accepted and the subsequent rounds of testing can begin.

Build Continuity

User companies that do not receive ready-made software packages can achieve efficiencies by automating their own build process. After source acceptance, the release manager triggers the new releases, preferably using a ticketing and tracking tool as discussed above, thereby initiating the automatic creation of the new software builds. These are numbered according to a prescribed numbering scheme and placed in the software repository. Deployment to the predefined environments can then be performed automatically, manually, or according to a schedule – generally overnight to permit uninterrupted testing.

Automating the individual process steps as described can bring big efficiency gains. However, there is also another significantly more convenient solution: one single automated process, covering steps from acceptance through deployment to regression testing and reporting. To ensure that the process does not come to an unexpected halt, clear and absolutely consistent standards are essential. In the case of testing in particular, an organization should determine in advance the areas where automation will really be worthwhile. In general, the Pareto principle applies: around 80 percent of tests can be usefully automated, while for the remaining 20 percent, changes are so frequent that manual testing is the more efficient solution.

Keep an Eye on the Big Picture

The greater the degree of automation, the more important it is to keep an eye on the big picture. Particularly where there are multiple test and production environments to manage, dashboards giving an overview of the various environments usually prove very useful. Using information provided by the ticketing system, dashboards give a transparent overview of which version of the software is at what status on which environment, supplementing the detailed status information provided by the reports on the various subprocesses such as acceptance, build and deployment and testing.

It is difficult to give a general estimate as to how long it will take a particular company to introduce the necessary standards and then automate its processes: IT environments vary in terms of the number of applications they comprise and the heterogeneity of their infrastructure. In addition, not every company will need to go through all the steps described here, particularly as many of them will already have begun industrializing their activities and implementing important standards. A realistic estimate is only possible in light of the circumstances in each individual case. There is one golden rule for all projects, however: a thorough and rigorous design phase, covering both technical and processrelated issues, is the most important requirement for avoiding problems once the system is in operation.

Must Haves

For a successful, delay-free-auto-mation project, it is essential to plan for the following from the outset:

  1. Secure management support from the very beginning.
  2. Standardize both your test and production environments.
  3. Limit the number of application servers and third-party tools.
  4. Give your suppliers clear specifications regarding integration scenarios and application servers.
  5. Restrict the number of build and deployment processes and tools you use.
  6. Administer the environment-dependent parameters for all environments centrally.
  7. Ensure compliance with standards and processes using a universal workflow tool.
  8. Ensure that parallel development branches for a software product are merged with the common target package in good time.

Build Management

A single enterprise build mana-gement tool is usually sufficient to cover all requirements. Tools that can help you automate and standardize in a clear and in-telligent fashion include the following tried-and-tested systems:

  • Hudson:Highest profile representative of the open source community that enjoys great popularity due in particular to its ease of installation and configuration.
  • AnthillPro: Allows customers to map their entire software life cycle in automated steps.
  • Bamboo: A new product from Atlassian which integrates and combines seamlessly with the company’s other products, such as the JIRA ticket system. These systems are compatible with all standard build tools, such as Ant and Maven.

Ticketing and Tracking

A standardized ticketing and tracking tool facilitates management and control of standardized processes. The choice of tool frequently hinges on the required degree of flexibility and the cost:

  • OTRS or Redmine: These well-known open source tools are an especially cost-effective introduction to issue tracking. Customization is sometimes time-consuming, however, and the entire software life cycle is not covered.
  • JIRA: A highly economical and flexible solution by Atlassian that avato consulting uses in its own projects.
    • Supports CMMI and ITIL processes.
    • Workflows can be adjusted quickly and easily via the web-based administration interface, so no special IT knowledge is required for configuration.
    • Provides interfaces to SCM systems, build systems and many others.
    • The manufacturer provides a large number of low-cost or free plug-ins.

These systems are compatible with all standard build tools, such as Ant and Maven.

Industrial Software Integration (PDF)
To ensure industrialization of an enterprise IT environment runs smoothly, the ideal starting point is right at the end of the application life cycle.

 

Do you have any further questions? Contact: marketing@avato.net

imprint: 

Date: February 2011
Author: Uwe Bloch
Contact: marketingnospam@avato.net
www.avato-consulting.com
© 2011 avato consulting
© avato consulting ag – Copyright 2019.
All Rights Reserved.