Introduction
I’m trying to implement neural network from scratch in Python recently. Considering to solve multi-class classification problem using neural network, I try to create a simple neural network. The most important thing in neural network is backpropagation. Backpropagation is an algorithm for supervised learning of artificial neural networks using gradient descent. I want to find the derivation of cross-entropy loss function with softmax activation function, so this article will record the formula I calculated. As for the rest, I will discuss it in the future.
Softmax Function
Softmax is a generalization of the logistic function to multiple dimensions. In general, the standard (unit) softmax function
where
In simple words, it applies the standard exponential function to each element
In Python, we code the softmax function as follow:
import numpy as np
def softmax(z):
exps = np.exp(z)
return exps / np.sum(exps)
The softmax function is actually numerically well-behaved. It has only positive terms, so we needn’t worry about loss of significance, and the denominator is at least as large as the numerator, so the result is guaranteed to fall between 0 and 1. The only accident that might happen is over- or under-flow in the exponentials. Overflow of a single or underflow of all elements of
We can choose an arbitrary value for nan
.
def stable_softmax(z):
exps = np.exp(z - np.max(z))
return exps / np.sum(exps)
Derivative of Softmax
One could also argue that in theory, using a deep network with a softmax function on top can represent any N-class probability function over the feature space. For this we need to calculate the derivative or gradient and pass it back to the previous layer during backpropagation.
Let’s take an easy example, we set
Similarly, we can get
So, the derivative of the softmax function is given as
Cross-Entropy
Cross-entropy is commonly used to quantify the difference between two probability distributions. Usually the ground truth distribution (the one that your machine learning algorithm is trying to match) is expressed in terms of a one-hot distribution. Cross-entropy loss is defined as
For example, suppose for a specific training instance, the true label is B (out of the possible labels A, B, and C). The one-hot distribution for this training instance is therefore [0, 1, 0]. You can interpret the above true distribution to mean that the training instance has 0% probability of being class A, 100% probability of being class B, and 0% probability of being class C. Now, suppose your machine learning algorithm predicts the following probability distribution [0.228, 0.619, 0.153]. Use the formula we can get
def cross_entropy(y, yhat):
loss = 0
for s1, s2 in zip(y, yhat):
log_likelihood = - np.multiply(s1, np.log(s2))
loss += np.sum(log_likelihood)
loss = loss / y.shape[0]
return loss
Derivative of Cross-Entropy with Softmax
Now we use the derivative of softmax that we derived earlier to derive the derivative of the cross entropy loss function.
We want to compute the derivative of
From derivative of softmax we derived earlier,
which is a very simple and elegant expression.
Conclusion
Real-world neural networks are capable of solving multi-class classification problems. In this article, we saw how I calculate the derivative of cross-entropy loss with softmax activation function. In the future, I will implement a simple feed-forward neural network using numpy only. With this computation beforehand, it will make the coding more easy to implement.
References
- https://stackoverflow.com/questions/42599498/numercially-stable-softmax
- https://deepnotes.io/softmax-crossentropy
- https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/
- http://cs231n.github.io/convolutional-networks/
- https://stackabuse.com/creating-a-neural-network-from-scratch-in-python-multi-class-classification/