I was watching this tutorial on weight initialization in neural network, and im not able to understand this statement:
In case of Tanh, Sigmoid activation, If we initialize weights with large values (range [0,1)), then the training becomes slow and the vanishing gradient problem may arise.
But how is that possible, i thought VGP is due to small values of gradients, which is caused by small weights or small output from activation