Weight initialization is the most important step while training the neural network. If weights are high, it may lead to exploding gradient. If weights are low, it may lead to vanishing gradient. Due to these issues, our model may take a long time to converge to global minima or sometimes it may never converge. So, weight initialization should be done with care.
Normally, weights are randomly initialized at the beginning. We use Gaussian distribution to randomly distribute these weights such that the mean of the distribution is zero and standard deviation is one. But the problem with this approach was that variance or standard deviation tend to change in next layers which lead to explode or vanish the gradients.
Xavier Weight Initialization Technique
With each passing layer, we want the variance or standard deviation to remain the same. This helps us keep the signal from exploding to a high value or vanishing to zero. In other words, we need to initialize the weights in such a way that the variance remains the same with each passing layer. This initialization process is known as Xavier initialization.
In Xavier initialization technique, we need to pick the weights from a Gaussian distribution with zero mean and a variance of 1/N (instead of 1), where N specifies the number of input neurons.
Notes:
1. Initially, it was suggested to take variance of 1/(Nin + Nout) instead of 1/N. Nin is the number of weights coming into the neuron and Nout is the number of weights going out of the neuron. But it was computationally complex, so it was discarded and we take only 1/N as variance.
2. In Keras, Xavier technique is used by default to initialize the weights in the neural network.
For ReLU activation function
If we are using ReLU as activation function in hidden layers, we need to go through following steps to implement Xavier initialization technique:
1. Generate random weights from a Gaussian distribution having mean 0 and a standard deviation of 1.
2. Multiply those random weights with the square root of (2/n). Here n is number of input units for that layer.
For other activation functions like Sigmoid or Hyperbolic Tangent
If we are using Sigmoid or Tanh as activation function in hidden layers, we need to go through following steps to implement Xavier initialization technique:
1. Generate random weights from a Gaussian distribution having mean 0 and a standard deviation of 1.
2. Multiply those random weights with the square root of (1/n). Here n is number of input units for that layer.
Normally, weights are randomly initialized at the beginning. We use Gaussian distribution to randomly distribute these weights such that the mean of the distribution is zero and standard deviation is one. But the problem with this approach was that variance or standard deviation tend to change in next layers which lead to explode or vanish the gradients.
Xavier Weight Initialization Technique
With each passing layer, we want the variance or standard deviation to remain the same. This helps us keep the signal from exploding to a high value or vanishing to zero. In other words, we need to initialize the weights in such a way that the variance remains the same with each passing layer. This initialization process is known as Xavier initialization.
In Xavier initialization technique, we need to pick the weights from a Gaussian distribution with zero mean and a variance of 1/N (instead of 1), where N specifies the number of input neurons.
Notes:
1. Initially, it was suggested to take variance of 1/(Nin + Nout) instead of 1/N. Nin is the number of weights coming into the neuron and Nout is the number of weights going out of the neuron. But it was computationally complex, so it was discarded and we take only 1/N as variance.
2. In Keras, Xavier technique is used by default to initialize the weights in the neural network.
For ReLU activation function
If we are using ReLU as activation function in hidden layers, we need to go through following steps to implement Xavier initialization technique:
1. Generate random weights from a Gaussian distribution having mean 0 and a standard deviation of 1.
2. Multiply those random weights with the square root of (2/n). Here n is number of input units for that layer.
For other activation functions like Sigmoid or Hyperbolic Tangent
If we are using Sigmoid or Tanh as activation function in hidden layers, we need to go through following steps to implement Xavier initialization technique:
1. Generate random weights from a Gaussian distribution having mean 0 and a standard deviation of 1.
2. Multiply those random weights with the square root of (1/n). Here n is number of input units for that layer.