Written on October 18, 2022 4 min read

(Insights) Focus on Cross-entropy loss and KL divergence

Deep learning model massively use the classical Cross-Entropy loss for training. Let’s look at this loss in more details


Deep Learning optimization problem

To be able to train a deep learning model, we need two basic things: 1) a model, 2) a way for it to learn. For classicifation tasks, the objective of the model is to predict the “correct” class based on a (labeled) input. Thus, the goal of the model is simply to minimize the error it makes on a distribution of data (or, in practice, on a dataset).

If $f_{\theta}: \mathcal{X} \to \Delta^{K}$ is the feature map with parameters $\theta$ of a classifer, that associate to an input $x$ the vector representing the predicted probability that $x$ belongs to each class, then we denote $h_{\theta}(x) := \text{argmax}(f_{\theta}(x))$ the classifier on a $K$ class problem. The objective is to minimize the risk associated with the loss $l$:

\(\min_{\theta} \mathbb{E}_{(X,Y) \sim P_{X,Y}}(l(Y, h_{\theta(X)}))\), where $P_{X,Y}$ is the distribution of the data. The value

In theory, what we want to minimize is the 0-1 loss: $\forall (y, y’) \in [1,K], l(y, y’) = 1$ if $y \neq y’$ and $l(y, y’) = 0$ if $y = y’$. It simply means that we penalize the model each time it makes an error, and we don’t penalize it when it’s right.

The problem of the 0-1 loss is that it is not usable in practice for a simple reason: it is piecewise constant, and thus not differentiable. And deep learning models rely on gradients to solve the optimization problem and update their parameters (typically by gradient descent).


Surrogate losses

To circumvent the non-differentiable issue of the 0-1 loss, we can simply replace this loss by another, more convenient one, that in the end makes the same job. This is exactly the concept of surrogate consistent losses: convex losses that are more useful in practice because easy to optimize, and that satisfies the consistency property stating that the function (or parameter $\theta$ in our case) minimizing the risk for the surrogate loss also minimizes the risk for the 0-1 loss.

There are several consistent surrogate losses for the 0-1 loss, which include for example the Hinge loss, the logistic loss, or the cross-entropy loss.

This property is super important to understand why we use the cross-entropy loss in deep learning (or others for other types of models) and it works.


Cross-entropy Loss

Let us identify $y$ with its one-hot vector representation (the vector with $0$ everywhere except on the y-th position where it is $1$). Then, the cross-entropy loss in our classification context is defined by:

\[CE(Y, f_{\theta}(X)) = - \sum_{k=1}^K Y_k \log(f_{\theta}(X)_k) = - Y_y \log(f_{\theta}(X)_y)\]

Note that the cross-entropy loss can of course be defined between any distributions $P$ and $Q$. We are in a specific case where one of our distribution, Y, is a Dirac, which simplifies the problem.

Kullback-Leibler Divergence

The KL divergence is a different notion of distance (it is not a metric, since it’s nt symmetric) between probability distributions. In general, for two distributions $P$ and $Q$, the KL divergence is defined by: $KL(P   Q) = \sum_{k=1}^K P(k) \log( P(k) + Q(k) )$. In our case, we have:
\[KL(Y || f_{\theta}(X)) = \sum_{k=1}^K Y_k \log(Y_k) - Y_k \log(f_{\theta}(X)_k)\]

Since the data is fixed, if we minimize the risk using the KL divergence, we can remove the part depending only on $Y_k$ in the minimization problem. Thus, minimizing the KL divergence is equivalent to minimizing $\sum_{k=1}^K Y_k \log(f_{\theta}(X)_k)$, which is exactly the cross-entropy loss.

What is worth noting here is that even though it’s not always obvious in practice because we use the experimental version of the definition above (because problems we want to solve are defined on datasets, not on distributions), deep down we are trying to minimize a distance between probability distributions. This open the way for several things:

1) Modyfying the distance used. Specifically, the optimal transport field of research has gained much traction recently and put the Wasserstein distance on the forefront, which is usually better for describing differences between probability distributions than the KL divergence. This could be explore, even though our distributions are so simple (a Dirac and a finite, countable distribution) that it is maybe not worth the change.

2) Modyfing the data distribution. For related problems (e.g. distillation), changing the label distribution of the data (thus the $Y$) can be a good idea and can be explored.