# The heart of machine learning: Understanding the importance of loss functions

### Why loss functions are essential for achieving accurate and reliable predictions from machine learning models in the supply chain context A relevant consideration in performing time series forecasting using machine learning models is the effect of different so-called ‘loss functions’. Loss functions are the driving force behind any machine learning model. They play a crucial role in evaluating the model’s performance. Loss functions are how one measures the difference between the predicted and true values, and they guide the model during the training process to find the optimal set of parameters – minimizing the loss.

Our intern Dirk Cremers took a deep dive into this topic for his master thesis. Here’s what you need to know:

### The different types of loss functions and how they work

You can use many different loss functions for time series forecasting, including the most known ones: the mean absolute percentage error (MAPE), mean absolute error (MAE), and root mean squared error (RMSE). However, in addition to these well-known loss functions, there are other loss functions (e.g. Huber and Tweedie) which can provide improved performance in certain scenarios. In the remainder of this blog post, we will go a bit more in-depth about these loss functions and explain them in more detail. However, before describing each loss function, here are some definitions that we use in this article:

• N is the number of observations
• F is the forecasted value
• A is the actual value
###### MAPE – the mean absolute percentage error

First off, the mean absolute percentage error (MAPE): This function is commonly used to evaluate the performance of machine learning models for forecasting. However, it is also used as a loss function. The MAPE is formulated as follows: The main advantage of using this loss function is that it results in a forecast that respects zero values in a dataset quite well (e.g. predicts a zero often when the actual is also zero). This behavior is due to the following: the MAPE divides each error separately by the demand, and is thus skewed: high errors during periods of low demand would greatly affect the MAPE. Therefore, optimizing the MAPE will result in a forecast that most often underpredicts the actual demand. For many businesses, this is not the kind of behavior they are looking for.

###### MAE – the mean absolute error

The mean absolute error (MAE) is a widely used loss function in machine learning, particularly in forecasting. It is defined as the average absolute difference between the predicted values and the true values, and can be expressed mathematically as: One of the main advantages of using the MAE as a loss function is that it is relatively robust to outliers. Furthermore, an interesting property of the MAE is that it can be shown to optimize according to the median of the errors.

However, the MAE has also some limitations. For example, it does not penalize large errors as heavily as other loss functions such as the root mean squared error (RMSE). This can be problematic in cases where large errors are more costly or most detrimental. Additionally, the MAE does not take into account the error relative to the actual value. Therefore, this loss function considers predicting 510 when the actual value is 500 as an equal error to predicting 20 when the actual is 10. Most planners would not agree with this judgment.

###### RSME – the root mean squared error

The root mean squared error (RMSE) is a useful loss function when large errors are more costly and it is important to minimize them. This is obtained by taking the squared root of the average squared difference between the predicted value and the actual value, e.g. As mentioned, the advantage of this loss function is that it heavily penalizes large errors, which can be useful in many cases. Furthermore, it can be derived that the RMSE optimizes according to the mean of the squared errors, in contrast to the median (MAE).

However, the RMSE has some limitations as well. For example, it is sensitive to outliers in the data, as large errors significantly increase the overall error. Furthermore, just like the MAE, it does not take into account the error relative to the actual value.

###### The Huber loss function

To combine the best of two worlds, the Huber function was proposed (Chen, 2018). The Huber loss function is namely a combination between the MAE and the RMSE and can be expressed mathematically as: One of the main advantages of using the Huber loss function is that it combines the properties of both the MAE and the RMSE. Like the MAE, it is relatively robust to outliers, as it is not heavily influenced by extreme values in the data.  Like the RMSE, it heavily penalizes large errors, which can be useful in cases where large errors are more costly. As a consequence, the Huber metric optimizes according to a combination of the median (MAE) and the mean (RMSE) according to this δ value and the size of the errors.

However, the Huber loss function has some limitations as well. For example, it requires the user to choose a value for the δ parameter, which can be challenging and may require some trial and error to find the optimal value. Furthermore, the Huber loss function still does not capture the relative error to the actual value.

To visualize the effect of these four loss functions we’ve created an example with four different forecasts according to loss functions. Furthermore, the example includes some peaks on certain days due to promotion. As one can see, each forecast clearly shows the properties of the specific loss function. ###### The Tweedie loss function

Finally, let’s discuss the Tweedie loss function (He Zhou, 2019). This loss function was initially introduced for predictions of insurance claims due to its ability to capture whether a zero value is expected. However, because of its ability to capture data that is skewed or with a heavy tail, it was also introduced for forecasting. For example, in the case of the dataset containing many zero values, traditional loss functions such as the mean absolute error (MAE) or root mean squared error (RMSE) may not be effective, as they can produce unreliable results when applied to this dataset.

Recent competitions like the M5 show that in these settings the Tweedie loss function outperforms others and results in a great performance. This is obtained by the Tweedie loss function being a generalization of the Poisson, Gamma and inverse Gaussian loss. Based on a specific value of the Tweedie parameter (p), this loss function can be expressed and aligned with the actual demand as well as possible. The main advantage of the Tweedie loss function is its ability to capture a wide range of distributions and be tailored to specific types of data by selecting an appropriate value for p. This in particular helps in forecasting products where the prior data exhibit intermittency or bulk purchasing.

However, to use the Tweedie loss function also means engaging in the trial-and-error process of finding the correct p-value for a given data set, which can be challenging.

### Conclusion

In general, the choice of loss function for time series forecasting will depend on the specific characteristics of your data and your business goals. It is important to carefully evaluate the performance of the model using different loss functions to find the one that provides the best results. By exploring new loss functions, we can find new ways to measure the difference between the predicted values and the true values, and this can lead to better performance of the model.

Therefore, loss functions are a crucial part of time series forecasting using machine learning models. By carefully choosing the right loss function for the task at hand, you can improve the accuracy of your predictions and make better business decisions as a result.

Do you have open questions or need support in choosing the right loss function for your business? Contact our expert Dan Roozemond!