Activation Functions
Linear and Non-linear functions used in neural networks explained!
What is an activation function?
It decides whether a neuron should be activated or not. This means that it will decide whether the neuron’s input to the network is important or not in the process of prediction using simpler mathematical operations.
It determines the output of neural networks like yes or no. It maps the resulting values between 0 to 1 or -1 to 1.
The objective of an activation function is to add non-linearity to the neural network.
In case we have a neural network working without the activation functions. Then, every neuron will only perform a linear transformation on the inputs using the weights and biases. It doesn’t matter how many hidden layers we attach to the neural network; all layers will behave in the same way because the composition of two linear functions is a linear function itself.
So, the neural network becomes simpler, learning any complex task is impossible, and our model would be just a linear regression model.
Non-Linear Activation Functions
A linear activation function has two major problems :
- It’s not possible to use backpropagation as the derivative of the function is a constant.
- All neural network layers will collapse into one if a linear activation function is used. No matter the number of layers in the neural network, the last layer will still be a linear function of the first layer, as we know the function of a linear function is also linear.
Non-linear activation functions solve the following limitations of linear activation functions:
- They allow backpropagation because now the derivative function would be related to the input, and it’s possible to go back and understand which weights in the input neurons can provide a better prediction.
- They allow the stacking of multiple layers of neurons as the output would now be a non-linear combination of input passed through multiple layers. Any output can be represented as a functional computation in a neural network.
Sigmoid function
σ(x) = 1/(1+exp(-x))
This function takes any real value as input and outputs values in the range of 0 to 1.
The larger the input (towards positive infinity), the closer the output value will be to 1.0, whereas the smaller the input (towards negative infinity), the closer the output will be to 0.0
It is commonly used for models where we have to predict the probability as an output. Since the probability of anything exists only between the range of 0 and 1, sigmoid is the right choice because of its range.
Cons:
- The derivative of the function is f’(x) = sigmoid(x)*(1-sigmoid(x)).
- For the larger values of x (positive or negative), the gradient value approaches zero, which leads to the vanishing gradients problem. (we will discuss about vanishing gradient in coming blogs).
Tanh Function
Tanh(x) = (exp(x)-exp(-x))/(exp(x)+exp(-x))
Tanh is also like a logistic sigmoid but better. The range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s-shaped).
The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero.
Cons:
- It also faces the problem of vanishing gradients similar to the sigmoid activation function. Plus the gradient of the tanh function is much steeper as compared to the sigmoid function.
Relu Function
Relu(x) = max(0, x)
The ReLU function is far more computationally efficient when compared to the sigmoid and tanh functions, as a certain number of neurons get deactivated.
It accelerates the convergence of gradient descent towards the global minimum of the loss function due to its linear, non-saturating property.
Cons:
The derivative of Relu(x) is 1 for x>0 and is 0 for x≤0.
In the dying ReLU problem, for the input value less than equal to zero the gradient value is zero. Due to this reason, during the backpropagation process, the weights and biases for some neurons are not updated. This can create dead neurons that never get activated.
Leaky ReLU Function
LeakyRelu(x) = max(0.01*x, x)
Leaky ReLU is an improved version of the ReLU function to solve the Dying ReLU problem as it has a small positive slope in the negative area.
The derivative of LeakyRelu(x) is 1 for x>0 and is 0.01 for x≤0.
The advantages of Leaky ReLU are the same as that of ReLU, in addition to the fact that it does enable backpropagation, even for negative input values.
By making this minor modification for negative input values, the gradient of the left side of the graph comes out to be a non-zero value. Therefore, we would no longer encounter dead neurons in that region.
Here is the derivative of the Leaky ReLU function.
Swish
Swish(x) = x/(1+eˆ-x)
Swish(x) = x*sigmoid(x)
It is a self-gated activation function developed by researchers at Google.
Swish consistently matches or outperforms the ReLU activation function on deep networks applied to various challenging domains such as image classification, machine translation, etc.
This function is bounded below but unbounded above i.e. Y approaches to a constant value as X approaches negative infinity but Y approaches to infinity as X approaches infinity.
Here are a few advantages of the Swish activation function over ReLU:
- Swish is a smooth function which means that it does not abruptly change direction like ReLU does near x = 0. Rather, it smoothly bends from 0 towards values < 0 and then upwards again.
- Small negative values were zeroed out in the ReLU activation function. However, those negative values may still be relevant for capturing patterns underlying the data. Large negative values are zeroed out for reasons of sparsity making it a win-win situation.
- The swish function being non-monotonous enhances the expression of input data and weight to be learnt.
Exponential Linear Units (ELUs) Function
ELU(x) = x, for x≥0
ELU(x) = alpha*(eˆx-1), for x<0
Exponential Linear Unit, or ELU for short, is also a variant of ReLU that modifies the slope of the negative part of the function.
ELU uses a log curve to define the negative values unlike the leaky ReLU and Parametric ReLU functions with a straight line.
ELU is a strong alternative for f ReLU because of the following advantages:
- ELU becomes smooth slowly until its output equals -α whereas RELU sharply smoothes.
- Avoids dead ReLU problem by introducing a log curve for negative values of input. It helps the network nudge weights and biases in the right direction.
The limitations of the ELU function are as follows:
- It increases the computational time because the exponential operation included
- No learning of the ‘a’ value takes place
- Exploding gradient problem
Scaled Exponential Linear Unit (SELU)
SELU(x) = lambda*x, for x≥0
SELU(x) = lambda*(alpha*(eˆx-1)), for x<0
SELU was defined in self-normalizing networks and takes care of internal normalization which means each layer preserves the mean and variance from the previous layers. SELU enables this normalization by adjusting the mean and variance.
SELU has both positive and negative values to shift the mean, which was impossible for the ReLU activation function as it cannot output negative values.
Gradients can be used to adjust the variance. The activation function needs a region with a gradient larger than one to increase it.
SELU has values of alpha α and lambda λ predefined.
Here’s the main advantage of SELU over ReLU:
- Internal normalization is faster than external normalization, which means the network converges faster.
SELU is a relatively newer activation function and needs more papers on architectures such as CNNs and RNNs, where it is comparatively explored.
Why are deep neural networks hard to train?
There are two challenges you might encounter when training your deep neural networks.
Let’s discuss them in more detail.
Vanishing Gradients
Like the sigmoid function, certain activation functions squish an ample input space into a small output space between 0 and 1.
Therefore, a large change in the input of the sigmoid function will cause a small change in the output. Hence, the derivative becomes small. For shallow networks with only a few layers that use these activations, this isn’t a big problem.
However, when more layers are used, it can cause the gradient to be too small for training to work effectively.
Exploding Gradients
Exploding gradients are problems where significant error gradients accumulate and result in very large updates to neural network model weights during training.
An unstable network can result when there are exploding gradients, and the learning cannot be completed.
The values of the weights become so large as to overflow and result in something called NaN values.