Stealing pages from the server...

Softmax and Cross-Entropy


Introduction

I’m trying to implement neural network from scratch in Python recently. Considering to solve multi-class classification problem using neural network, I try to create a simple neural network. The most important thing in neural network is backpropagation. Backpropagation is an algorithm for supervised learning of artificial neural networks using gradient descent. I want to find the derivation of cross-entropy loss function with softmax activation function, so this article will record the formula I calculated. As for the rest, I will discuss it in the future.

Softmax Function

Softmax is a generalization of the logistic function to multiple dimensions. In general, the standard (unit) softmax function : is defined by the formula:

where for .

In simple words, it applies the standard exponential function to each element of the input vector and normalises these values by dividing by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector is 1.

In Python, we code the softmax function as follow:

import numpy as np

def softmax(z):
    exps = np.exp(z)
    return exps / np.sum(exps)

The softmax function is actually numerically well-behaved. It has only positive terms, so we needn’t worry about loss of significance, and the denominator is at least as large as the numerator, so the result is guaranteed to fall between 0 and 1. The only accident that might happen is over- or under-flow in the exponentials. Overflow of a single or underflow of all elements of will render the output more or less useless. To make our softmax function numerically stable, we simply normalize the values in the vector, by multiplying the numerator and denominator with a constant .

We can choose an arbitrary value for term, but generally is chosen. This will avoid overflowing and resulting in nan.

def stable_softmax(z):
    exps = np.exp(z - np.max(z))
    return exps / np.sum(exps)

Derivative of Softmax

One could also argue that in theory, using a deep network with a softmax function on top can represent any N-class probability function over the feature space. For this we need to calculate the derivative or gradient and pass it back to the previous layer during backpropagation.

Let’s take an easy example, we set , so we know that . And we can get , , . First, we caclulate , , .

Similarly, we can get , , , , , .

So, the derivative of the softmax function is given as where is Kronecker delta.

Cross-Entropy

Cross-entropy is commonly used to quantify the difference between two probability distributions. Usually the ground truth distribution (the one that your machine learning algorithm is trying to match) is expressed in terms of a one-hot distribution. Cross-entropy loss is defined as

For example, suppose for a specific training instance, the true label is B (out of the possible labels A, B, and C). The one-hot distribution for this training instance is therefore [0, 1, 0]. You can interpret the above true distribution to mean that the training instance has 0% probability of being class A, 100% probability of being class B, and 0% probability of being class C. Now, suppose your machine learning algorithm predicts the following probability distribution [0.228, 0.619, 0.153]. Use the formula we can get

def cross_entropy(y, yhat):
    loss = 0
    for s1, s2 in zip(y, yhat):
        log_likelihood = - np.multiply(s1, np.log(s2))
        loss += np.sum(log_likelihood)
    loss = loss / y.shape[0]
    return loss

Derivative of Cross-Entropy with Softmax

Now we use the derivative of softmax that we derived earlier to derive the derivative of the cross entropy loss function.

We want to compute the derivative of with respect to .

From derivative of softmax we derived earlier,

which is a very simple and elegant expression.

Conclusion

Real-world neural networks are capable of solving multi-class classification problems. In this article, we saw how I calculate the derivative of cross-entropy loss with softmax activation function. In the future, I will implement a simple feed-forward neural network using numpy only. With this computation beforehand, it will make the coding more easy to implement.

References

  1. https://stackoverflow.com/questions/42599498/numercially-stable-softmax
  2. https://deepnotes.io/softmax-crossentropy
  3. https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/
  4. http://cs231n.github.io/convolutional-networks/
  5. https://stackabuse.com/creating-a-neural-network-from-scratch-in-python-multi-class-classification/

Author: Yang Wang
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Yang Wang !
 Previous
Sentiment Analysis for Japanese Customer Reviews Sentiment Analysis for Japanese Customer Reviews
The development of elec-tronic business is accelerated by the popularity of the internet. Millions of people buy products and post their reviews online. Public opinion analysis can be used with these reviews. Customers can make better decisions after reading other people's product reviews. There is a pressing need for building the system which can perform the sentiment classification job. In this article, I'll try to build a sentiment anaylsis model for Japanese customer reviews.
2021-04-23
Next 
Viterbi Algorithm for HMM Decoding Viterbi Algorithm for HMM Decoding
Viterbi Algorithm is usually used to find the most likely sequence in HMM. It is now also commonly used in speech recognition, speech synthesis, diarization, keyword spotting, computational linguistics, and bioinformatics. This semester, in the course "Speech Technology", the acoustic signal is treated as the observed sequence of events, and a string of text is considered to be the hidden cause of the acoustic signal in speech recognition task. The Viterbi algorithm finds the most likely string of text given the acoustic signal.
2021-04-17
  TOC