Killer Combo: Softmax and Cross Entropy

Paolo Perrotta

Published in

Level Up Coding

6 min readJun 27, 2020

The derivative of the softmax and the cross entropy loss, explained step by step.

Take a glance at a typical neural network — in particular, its last layer. Most likely, you’ll see something like this:

The softmax and the cross entropy loss fit together like bread and butter. Here is why: to train the network with backpropagation, you need to calculate the derivative of the loss. In the general case, that derivative can get complicated. But if you use the softmax and the cross entropy loss, that complexity fades away. Instead of a long clunky formula, you end up with this terse, easy to compute thing:

A few readers of Programming Machine Learning asked me where this derivative comes from. This post is all about that calculation: it shows you how to derive the formula above, step by step.

Disclaimer: this post contains a lot of formulae, but I’m a bit of a wimp when it comes to math. I prefer detailed explanations over succinct mathematical proofs. Hardcore math fans will find my explanations very long-winded.

Before we dive into the calculation, let’s lay out the mathematical tools that we’ll need.

A Few Tools We Need

In the rest of this post, we’ll need a handful of derivation rules. I’ll recap four of them here.

First, let’s say that we have two functions: f(x) and g(x). The chain rule tells us how to calculate the derivative of their composition, f(g(x)):

We’ll also use the quotient rule to calculate the derivative of a fraction:

Finally, we’ll need a couple of basic derivatives — the kind that you learn by rote memory. The derivative of the logarithm function, and the derivative of the exponential function:

With these tools at hand, let’s calculate the derivative of the cross entropy loss applied to the softmax. Let’s start with the derivative of the softmax alone, and we’ll come back to the cross entropy loss later.

Deriving the Softmax

Let’s look at the softmax function. As a reminder, the outputs of the softmax are also the outputs of the entire neural network, that I called ŷ. Its inputs are the nodes in the last hidden layer. They’re usually called logits, so I named them l in this formula:

The softmax has an equal number of inputs and outputs, so there are as many ŷ as l. I used the letter k for the number of inputs and outputs.

The sum in the denominator is a bit hard to read, and we’re going to see a lot of it. To make it easier on the eye, I’ll omit its index:

Now we’re looking to calculate the derivative of the softmax over its inputs:

Both indexes, i and n, can take any value from 1 to k.

This derivative might look confusing at first. Here is a trick that makes it easier to deal with: don’t try to calculate the generic derivative for any value of n and i. Instead, calculate it for two separate cases: the case where i and n are the same, and the case where they’re different. Let’s do that.

The Case Where i=n

Let’s start with the case where i and n are the same. That means that we’re calculating the derivative of a softmax output with respect to its matching input.

Do you still remember the quotient rule, that we mentioned a few paragraphs ago? Good! Let’s put it to good use:

The Case Where i≠n

Now let’s look at the case where i and n are different. So we’re calculating the derivative of an output with respect to any non-matching input:

Wrapping Up the Derivative of the Softmax

Let’s squash the derivatives that we found into a single formula:

Mathematicians have their own notations to mean: “apply this part of the formula only when this condition is true”. I used color instead. The green box in the formula applies when i=n, and the blue tab applies when i≠n.

With that, we have the derivative of the softmax. On to the derivative of the cross entropy loss!

Deriving the Cross Entropy Loss

Here is the formula of the cross entropy loss for a single output of the neural network:

To get the total loss for all the network’s outputs, we can sum the losses over each output:

I have to make a digression here, to avoid confusion. In the book and in this other post, I show you a subtly different formula for the cross entropy loss. Both formulae are valid, but they mean different things. The one above is the loss of a single example — a piece of data moving through the network. In this formula, yᵢ means: “the i-th component of the label.” By contrast, the formula in the book is the total loss over an entire dataset. In there, yᵢ means: “the i-th label.” To get that total loss, you’d calculate the loss for each example using the formula above — and then average all those losses.

In other words: this is the the same loss that you know, and hopefully love, from the book. However, in the book we look at the big picture, while in this post we look at the details.

OK, digression done. Let’s take the derivative of the cross entropy loss with respect to one of its inputs. Hint: remember the chain rule before you dive in.

Earlier on, we calculated the derivative of the softmax. Now we have the derivative of the cross entropy loss, expressed in terms of the derivative of the softmax. Let’s mash-up the two formulae and see what we get.

Putting It All Together

I promise that we’re almost there. We have one last calculation to go — but it’s a tricky one. Let’s replace the derivative of the softmax into the derivative of the cross entropy loss:

And there we have it: the derivative of the cross entropy loss applied to the softmax. It took a while to calculate this thing — but now that we have it, we can lay back and admire its simplicity. And even better: we’ll never have to calculate this derivative again!

This posts is a spin-off of Programming Machine Learning, a zero-to-hero introduction for programmers, from the basics to deep learning. Go here for the eBook, here for the paper book, or come to the forum if you have questions and comments!