Why ReLu activation is preferred over other activations?
Let me put it down in two ways:
Problem of Vanishing gradients and faster training
The ReLu function is given by max(0,a) where a= W x+b and now, if a is greater than 0, the gradient has a constant value in contrast to sigmoid where gradients keep decreasing as absolute value of input x increases. Hence resulting into faster learning. Also the gradient of sigmoid is always smaller than 1. Thus, in deep RNN, these gradients are multiplied with increasing number of layers. Hence deep learning is nearly impossible.
Now, when ‘a’ is less than or equal to zero, with more number of units, an intensely sparse network is formed. Wile the sigmoids tend to give dense network due to non-zero values.