Activation functions


Concatenated ReLU.

f(x) = \text{concat}(\text{ReLU}(x), \text{ReLU}(-x))

Using the CReLU doubles the size of the input to the next layer, increasing the number of parameters. However, Shang et al. showed that CReLU can improve accuracy on image recoginition tasks when used for the lower convolutional layers, even when halving the number of filters in those layers at the same time.


Exponential Linear Unit.

f(x) =
  x, & x > 0 \\
  \alpha (\exp(x) - 1), & x \leq 0


In practice the hyperparameter \alpha is always set to 1.

Compared to ReLUs, ELUs have a mean activation closer to zero which is helpful. However, this advantage is probably nullified by batch normalization.

The more gradual decrease of the gradient should also make them less susceptible to the dying ReLU problem, although they will suffer from the vanishing gradients problem instead.


Gaussian Error Linear Unit. The name comes from the use of the Gaussian error function in the definition:

f(x) = x \Phi(x)

where \Phi(x) is the CDF of the normal distribution.

It can be approximated as:

f(x) = x \sigma (1.702 x)


This can be seen as a smoothed version of the ReLU.

Was found to improve performance on a variety of tasks compared to ReLU and ELU (Hendrycks and Gimpel (2016)). The authors speculate that the activation’s curvature and non-monotonicity may help it to model more complex functions.


Leaky ReLU. Motivated by the desire to have gradients where the ReLU would have none but the gradients are very small and therefore vulnerable to the vanishing gradients problem in deep networks. The improvement in accuracy from using LReLU instead of ReLU has been shown to be very small (Maas et al. (2013)).

f(x) = \max\{ax,x\}

a is a fixed hyperparameter, unlike the PReLU. A common setting is 0.01.



An activation function designed to be used with dropout.

f(x) = \max_{j \in [1,k]} x^T W_j + b_j

where k is a hyperparameter.

Maxout can be a piecewise linear approximation for arbitrary convex activation functions. This means it can approximate ReLU, LReLU, ELU and linear activations but not tanh or sigmoid.

Was used to get state of the art performance on MNIST, SVHN, CIFAR-10 and CIFAR-100.


Parametric ReLU.


Where a is a learned parameter, unlike in the Leaky ReLU where it is fixed.

Was used to achieve state of the art performance on ImageNet (He et al. (2015)).


Rectified Linear Unit. Unlike the sigmoid or tanh activations the ReLU does not saturate which has led to it being widely used in deep networks.



The fact that the gradient is 1 when the input is positive means it does not suffer from vanishing and exploding gradients. However, it suffers from its own ‘dying ReLU problem’ instead.

The Dying ReLU Problem

When the input to a neuron is negative, the gradient will be zero. This means that gradient descent will not update the weights so long as the input remains negative. A smaller learning rate helps solve this problem.

The Leaky ReLU and the Parametric ReLU (PReLU) attempt to solve this problem by using f(x)=\max\{ax,x\} where a is a small constant like 0.1. However, this small gradient when the input in negative means vanishing gradients are once again a problem.


Scaled Exponential Linear Unit.

f(x) = \lambda
  x, & x > 0 \\
  \alpha (\exp(x) - 1), & x \leq 0

Where \lambda and \alpha are hyperparameters, set to \lambda =  1.0507 and \alpha = 1.6733.


The SELU is designed to be used in networks composed of many fully-connected layers, as opposed to CNNs or RNNs, the principal difference being that CNNs and RNNs stabilize their learning via weight sharing. As with batch normalization, SELU activations give rise to activations with zero mean and unit variance but without having to explicitly normalize.

The ELU is a very similar activation. The only difference is that it has \lambda =  1 and \alpha = 1.


Klambauer et al. (2017) recommend initialising layers with SELU activations according to \theta^{(i)} \sim N(0, \sqrt{1/n_i}) where \theta^{(i)} are the parameters for layer i of the network and n_i is the size of layer i of the network.


Instead of randomly setting units to zero as in conventional dropout, the authors propose setting units to \alpha ' = -\lambda \alpha where \lambda and \alpha are the hyperparameters given previously. They refer to this as alpha dropout.


Activation function that maps outputs to be between 0 and 1.

f(x) = \frac{e^x}{e^x + 1}


Has problems with saturation. This makes vanishing and exploding gradients a problem and initialization extremely important.


All entries in the output vector are in the range (0,1) and sum to 1, making the result a valid probability distribution.

f(x)_j = \frac{e^{x_j}}{\sum_{k=1}^K e^{x_k}}, j \in {1,...,K}

Where x is a vector of length K. This vector is often referred to as the logit.

Unlike most other activation functions, the softmax does not apply the same function to each item in the input independently. The requirement that the output vector sums to 1 means that if one of the inputs is increased the others must decrease in the output.

The Softmax Bottleneck

A theorised problem that occurs when using the softmax to predict the next token in language modeling. It views language modeling as a matrix factorization problem:

HW^T = A

Where H is the contexts, W are the word vectors and A are the conditional probabilities for words given contexts. The vast number of contexts in language means that the matrix A is almost certainly high rank and so the dimensionality of the word embeddings is probably not sufficient to solve the matrix factorization problem adequately.

Mixture of Softmaxes

Mixture model intended to avoid the Softmax Bottleneck. The probability of a word x given some context c is the weighted average of k softmax distributions:

P(x|c) = \sum_{k=1}^K \pi_{ck} \frac{\exp h_{ck}^T w_{x}}{\sum_{x'} \exp h_{ck}^T w_{x'}} s.t. \sum_{k=1}^K \pi_{ck} = 1

where \pi_{ck} is the weight of component k.


Activation whose output is bounded between 0 and infinity, making it useful for modeling quantities that should never be negative such as the variance of a distribution.

f(x) = \log(1 + e^x)


Unlike the ReLU, gradients can pass through the softmax when x < 0.


Activation function that is used in the GRU and LSTM. It is between -1 and 1 and centered around 0, unlike the sigmoid.

f(x) = tanh(x)


Has problems with saturation like the sigmoid. This makes vanishing and exploding gradients a problem and initialization extremely important.