Activation Functions

A neural network without activation functions is just a linear regression with extra steps. The nonlinearity between layers is what turns a stack of matrix multiplies into a universal function approximator. Here's every activation worth knowing, and why the field moved from sigmoid to ReLU to GELU.

Prereq: basic calculus Time to read: ~18 min Interactive figures: 1 Code: NumPy, PyTorch, JAX, CUDA

1. Why you need them at all

Consider a neural network with no activation functions — just a stack of matrix multiplications:

f(\mathbf{x}) = W_3 \, (W_2 \, (W_1 \, \mathbf{x}))

Matrix multiplication is associative, so this is the same as:

f(\mathbf{x}) = (W_3 W_2 W_1) \, \mathbf{x} = W_{\text{combined}} \, \mathbf{x}

Three layers collapsed into one. No matter how many linear layers you stack, the result is still a single linear transformation — you've burned compute for nothing. A linear map can only draw straight decision boundaries, only fit linear functions, only separate linearly separable classes. Without nonlinearity, "deep learning" is a grandiose name for logistic regression.

An activation function is a simple nonlinear function $\phi: \mathbb{R} \to \mathbb{R}$ applied elementwise between layers:

f(\mathbf{x}) = W_3 \, \phi(W_2 \, \phi(W_1 \, \mathbf{x}))

Now the composition is genuinely nonlinear and can no longer be collapsed. In fact, with just one hidden layer and enough units, such a network can approximate any continuous function on a compact domain — the universal approximation theorem (Cybenko 1989, Hornik 1991). But not all activations are created equal; the choice has outsized effects on trainability, convergence speed, and final accuracy.

2. Step / Heaviside

\text{step}(z) = \begin{cases} 1 & z > 0 \\ 0 & z \le 0 \end{cases}

The step function at a glance

$z$: The pre-activation — the weighted sum $\mathbf{w} \cdot \mathbf{x} + b$ coming into a neuron.
$\text{step}(z)$: Output: 1 if the neuron "fires," 0 otherwise. Models a biological neuron's all-or-nothing spike.

Analogy A light switch: below threshold, off; above threshold, on. Used by the perceptron (1957). The reason it fell out of favor: the derivative is 0 almost everywhere (and undefined at 0), so gradient-based learning can't work at all. The perceptron learned via a hand-crafted rule, not backprop. When backprop came along, the step function was first on the chopping block.

3. Sigmoid & tanh

\sigma(z) = \frac{1}{1 + e^{-z}}, \qquad \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Sigmoid & tanh — the classic smooth gates

$\sigma(z)$: Logistic sigmoid. Squashes any real number into $(0, 1)$. Interpretable as a probability, which is why it's still used in the final layer of binary classifiers.
$\tanh(z)$: Hyperbolic tangent. Squashes into $(-1, 1)$. Equivalent to a rescaled, shifted sigmoid: $\tanh(z) = 2\sigma(2z) - 1$. Zero-centered, which makes gradient flow through stacks of layers better than sigmoid.
$\sigma'(z) = \sigma(z)(1 - \sigma(z))$: Derivative is maximized at $z = 0$, where $\sigma'(0) = 0.25$. At $|z| = 5$, it's already down to about $0.007$. This is where the "vanishing gradient" problem comes from: stack 10 sigmoid layers and gradients multiply by at most $0.25^{10} \approx 10^{-6}$.
$\tanh'(z) = 1 - \tanh^2(z)$: Maximized at 0 where $\tanh'(0) = 1$. A little healthier than sigmoid, but still saturates for large $|z|$.

Analogy A sigmoid is like a thermostat that slowly warms up as the input rises and plateaus near the max. Fine for one or two layers. Stack six of them and by the time a gradient signal has passed through, it's been squeezed toward zero so many times that the early layers receive almost no information about how to improve. This is why deep networks were stuck at 2–3 layers throughout the 1990s and early 2000s.

THE VANISHING GRADIENT, MADE EXPLICIT

If every layer's pre-activation sits near the saturated region of the sigmoid, the backprop signal through that layer is multiplied by roughly 0.01. Ten layers later: $10^{-20}$. No finite learning rate can make $10^{-20}$ gradients do anything. The weights in the early layers never budge. The network "doesn't train." Every deep learning practitioner in 2010 knew this pain intimately.

Source — Sigmoid & Tanh

// Sigmoid: squashes R -> (0, 1)
sigmoid(x)  = 1 / (1 + exp(-x))
sigmoid'(x) = sigmoid(x) * (1 - sigmoid(x))

// Tanh: squashes R -> (-1, 1)
tanh(x)     = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
            = 2 * sigmoid(2*x) - 1
tanh'(x)    = 1 - tanh(x)^2

// Numerically stable sigmoid for very negative x:
//   sigmoid(x) = exp(x) / (1 + exp(x))  when x < 0

import numpy as np

def sigmoid(x):
    """Numerically stable sigmoid."""
    return np.where(x >= 0,
                    1.0 / (1.0 + np.exp(-x)),
                    np.exp(x) / (1.0 + np.exp(x)))

def sigmoid_deriv(x):
    s = sigmoid(x)
    return s * (1.0 - s)

def tanh(x):
    return np.tanh(x)   # built-in is already stable

def tanh_deriv(x):
    return 1.0 - np.tanh(x) ** 2

# Quick check
import numpy as np
x = np.array([-5.0, -1.0, 0.0, 1.0, 5.0])
print("sigmoid:", sigmoid(x).round(4))          # [0.0067, 0.2689, 0.5, 0.7311, 0.9933]
print("sigmoid':", sigmoid_deriv(x).round(4))  # [0.0066, 0.1966, 0.25, 0.1966, 0.0066]
print("tanh:", tanh(x).round(4))                # [-0.9999, -0.7616, 0., 0.7616, 0.9999]
print("tanh':", tanh_deriv(x).round(4))         # [0.0001, 0.42, 1., 0.42, 0.0001]

function sigmoid(x) {
  return x >= 0
    ? 1 / (1 + Math.exp(-x))
    : Math.exp(x) / (1 + Math.exp(x));   // stable for negative x
}

function sigmoidDeriv(x) {
  const s = sigmoid(x);
  return s * (1 - s);
}

function tanhFn(x) { return Math.tanh(x); }
function tanhDeriv(x) { return 1 - Math.tanh(x) ** 2; }

// Vectorised helpers
const sigmoidVec   = xs => xs.map(sigmoid);
const sigmoidDerivVec = xs => xs.map(sigmoidDeriv);
const tanhVec      = xs => xs.map(tanhFn);
const tanhDerivVec = xs => xs.map(tanhDeriv);

#include <math.h>

float sigmoid(float x) {
    if (x >= 0.f) return 1.f / (1.f + expf(-x));
    float ex = expf(x);
    return ex / (1.f + ex);
}

float sigmoid_deriv(float x) {
    float s = sigmoid(x);
    return s * (1.f - s);
}

float tanh_fn(float x)    { return tanhf(x); }
float tanh_deriv(float x) { float t = tanhf(x); return 1.f - t * t; }

/* Elementwise application to array */
void sigmoid_forward(const float *in, float *out, int n) {
    for (int i = 0; i < n; i++) out[i] = sigmoid(in[i]);
}
void tanh_forward(const float *in, float *out, int n) {
    for (int i = 0; i < n; i++) out[i] = tanhf(in[i]);
}

#include <cmath>
#include <vector>

inline float sigmoid(float x) {
    return x >= 0.f ? 1.f / (1.f + std::exp(-x))
                    : std::exp(x) / (1.f + std::exp(x));
}

inline float sigmoidDeriv(float x) { float s = sigmoid(x); return s * (1.f - s); }
inline float tanhFn(float x)       { return std::tanh(x); }
inline float tanhDeriv(float x)    { float t = std::tanh(x); return 1.f - t * t; }

std::vector<float> applySigmoid(const std::vector<float> &v) {
    std::vector<float> out(v.size());
    for (size_t i = 0; i < v.size(); i++) out[i] = sigmoid(v[i]);
    return out;
}
std::vector<float> applyTanh(const std::vector<float> &v) {
    std::vector<float> out(v.size());
    for (size_t i = 0; i < v.size(); i++) out[i] = std::tanh(v[i]);
    return out;
}

public class SigmoidTanh {
    public static float sigmoid(float x) {
        if (x >= 0) return 1f / (1f + (float) Math.exp(-x));
        float ex = (float) Math.exp(x);
        return ex / (1f + ex);
    }

    public static float sigmoidDeriv(float x) {
        float s = sigmoid(x);
        return s * (1f - s);
    }

    public static float tanh(float x)      { return (float) Math.tanh(x); }
    public static float tanhDeriv(float x) { float t = (float) Math.tanh(x); return 1f - t * t; }

    public static float[] sigmoidForward(float[] in) {
        float[] out = new float[in.length];
        for (int i = 0; i < in.length; i++) out[i] = sigmoid(in[i]);
        return out;
    }
}

package main

import "math"

func sigmoid(x float64) float64 {
    if x >= 0 {
        return 1 / (1 + math.Exp(-x))
    }
    ex := math.Exp(x)
    return ex / (1 + ex)
}

func sigmoidDeriv(x float64) float64 {
    s := sigmoid(x)
    return s * (1 - s)
}

func tanhFn(x float64) float64    { return math.Tanh(x) }
func tanhDeriv(x float64) float64 { t := math.Tanh(x); return 1 - t*t }

func sigmoidSlice(xs []float64) []float64 {
    out := make([]float64, len(xs))
    for i, x := range xs { out[i] = sigmoid(x) }
    return out
}

4. ReLU

\text{ReLU}(z) = \max(0, z)

The Rectified Linear Unit. Described by Nair & Hinton (2010) and popularized by AlexNet (2012). Offensively simple, and it broke deep learning wide open.

Why ReLU changed everything

$\text{ReLU}(z)$: Pass positive values through unchanged; clamp negatives to 0. Dead simple to compute. No exponentials, no divisions — just a sign check.
$\text{ReLU}'(z)$: $1$ for $z > 0$, $0$ for $z < 0$, undefined at $z = 0$ (usually set to $0$ by convention). Here's the magic: the gradient is 1 on the active side. No decay, no squashing. A gradient flowing back through 20 active ReLU layers comes out with its full magnitude.

Analogy Think of a sigmoid as a bureaucrat's rubber stamp that inks lighter and lighter as you push harder. ReLU is just "yes, pass" or "no, blocked." Every active unit ships the gradient at full strength. And because about half the units are zero at any given time, the representation is sparse — a form of built-in regularization. When AlexNet showed that a ReLU network could train in a few days on ImageNet and crush every sigmoid-based baseline, every paper after 2012 switched.

ReLU has one famous failure mode: the dying ReLU. If a unit's pre-activation is negative for every training example, its output is always 0, its gradient is always 0, and the weights feeding into it never update. The unit is dead and the network capacity that funded it is wasted. In practice this happens when learning rates are too high or weights are badly initialized; it can permanently kill a significant fraction of units.

5. Leaky ReLU, PReLU, ELU

All three fix the dying ReLU by making the negative side nonzero.

\text{LeakyReLU}(z) = \begin{cases} z & z > 0 \\ \alpha z & z \le 0 \end{cases} \quad (\alpha = 0.01 \text{ typical})

\text{PReLU}(z) = \begin{cases} z & z > 0 \\ a \, z & z \le 0 \end{cases} \quad (a \text{ is learned})

\text{ELU}(z) = \begin{cases} z & z > 0 \\ \alpha (e^z - 1) & z \le 0 \end{cases}

Variant glossary

$\alpha$ (Leaky): Small fixed slope on the negative side, usually $0.01$. Keeps the gradient nonzero so a unit can "wake up" if its pre-activation drifts.
$a$ (PReLU): Parametric ReLU — the negative slope is itself a learned parameter, one per unit or one per channel. Adds a few more weights but often pays for itself.
ELU: Exponential Linear Unit (Clevert et al. 2015). Negative side is a smooth saturating curve, which pushes the mean activation closer to zero and helps with normalization. Smoother derivative at 0 than Leaky ReLU, and bounded negative output (can't go below $-\alpha$).

Source — ReLU & Variants

relu(x)             = max(0, x)
relu'(x)            = 1 if x > 0 else 0   // subgradient; 0 at x=0 by convention

leaky_relu(x, a=0.01) = x if x > 0 else a*x
leaky_relu'(x, a)     = 1 if x > 0 else a

elu(x, a=1.0)  = x              if x > 0
               = a * (exp(x)-1) if x <= 0
elu'(x, a)     = 1              if x > 0
               = elu(x, a) + a  if x <= 0   // = a*exp(x)

import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_deriv(x):
    return (x > 0).astype(float)   # subgradient; 0 at x=0

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

def leaky_relu_deriv(x, alpha=0.01):
    return np.where(x > 0, 1.0, alpha)

def elu(x, alpha=1.0):
    return np.where(x > 0, x, alpha * (np.exp(x) - 1.0))

def elu_deriv(x, alpha=1.0):
    return np.where(x > 0, 1.0, alpha * np.exp(x))

# Demonstration
x = np.array([-2.0, -0.5, 0.0, 0.5, 2.0])
print("relu:", relu(x))
print("leaky:", leaky_relu(x))
print("elu:", elu(x).round(4))

function relu(x)      { return Math.max(0, x); }
function reluDeriv(x) { return x > 0 ? 1 : 0; }

function leakyRelu(x, alpha = 0.01) { return x > 0 ? x : alpha * x; }
function leakyReluDeriv(x, alpha = 0.01) { return x > 0 ? 1 : alpha; }

function elu(x, alpha = 1.0) {
  return x > 0 ? x : alpha * (Math.exp(x) - 1);
}
function eluDeriv(x, alpha = 1.0) {
  return x > 0 ? 1 : alpha * Math.exp(x);
}

// Vectorised
const reluVec      = xs => xs.map(relu);
const leakyReluVec = (xs, a = 0.01) => xs.map(x => leakyRelu(x, a));
const eluVec       = (xs, a = 1.0)  => xs.map(x => elu(x, a));

#include <math.h>

float relu(float x)      { return x > 0.f ? x : 0.f; }
float relu_deriv(float x){ return x > 0.f ? 1.f : 0.f; }

float leaky_relu(float x, float alpha) { return x > 0.f ? x : alpha * x; }
float leaky_relu_deriv(float x, float alpha) { return x > 0.f ? 1.f : alpha; }

float elu(float x, float alpha) {
    return x > 0.f ? x : alpha * (expf(x) - 1.f);
}
float elu_deriv(float x, float alpha) {
    return x > 0.f ? 1.f : alpha * expf(x);
}

void relu_forward(const float *in, float *out, int n) {
    for (int i = 0; i < n; i++) out[i] = in[i] > 0.f ? in[i] : 0.f;
}

#include <cmath>
#include <vector>
#include <algorithm>

inline float relu(float x)           { return std::max(0.f, x); }
inline float reluDeriv(float x)      { return x > 0.f ? 1.f : 0.f; }
inline float leakyRelu(float x, float a = 0.01f) { return x > 0.f ? x : a * x; }
inline float leakyReluDeriv(float x, float a = 0.01f) { return x > 0.f ? 1.f : a; }
inline float elu(float x, float a = 1.0f) {
    return x > 0.f ? x : a * (std::exp(x) - 1.f);
}
inline float eluDeriv(float x, float a = 1.0f) {
    return x > 0.f ? 1.f : a * std::exp(x);
}

std::vector<float> applyRelu(const std::vector<float> &v) {
    std::vector<float> out(v.size());
    for (size_t i = 0; i < v.size(); i++) out[i] = relu(v[i]);
    return out;
}

public class ReluVariants {
    public static float relu(float x)      { return Math.max(0f, x); }
    public static float reluDeriv(float x) { return x > 0f ? 1f : 0f; }

    public static float leakyRelu(float x, float alpha) {
        return x > 0f ? x : alpha * x;
    }
    public static float leakyReluDeriv(float x, float alpha) {
        return x > 0f ? 1f : alpha;
    }

    public static float elu(float x, float alpha) {
        return x > 0f ? x : alpha * ((float) Math.exp(x) - 1f);
    }
    public static float eluDeriv(float x, float alpha) {
        return x > 0f ? 1f : alpha * (float) Math.exp(x);
    }

    public static float[] reluForward(float[] in) {
        float[] out = new float[in.length];
        for (int i = 0; i < in.length; i++) out[i] = relu(in[i]);
        return out;
    }
}

package main

import "math"

func relu(x float64) float64      { if x > 0 { return x } ; return 0 }
func reluDeriv(x float64) float64 { if x > 0 { return 1 } ; return 0 }

func leakyRelu(x, alpha float64) float64 {
    if x > 0 { return x }
    return alpha * x
}
func leakyReluDeriv(x, alpha float64) float64 {
    if x > 0 { return 1 }
    return alpha
}

func elu(x, alpha float64) float64 {
    if x > 0 { return x }
    return alpha * (math.Exp(x) - 1)
}
func eluDeriv(x, alpha float64) float64 {
    if x > 0 { return 1 }
    return alpha * math.Exp(x)
}

6. GELU, Swish / SiLU

Around 2016–2017, researchers started experimenting with smooth activations that behaved like ReLU in the large but were differentiable everywhere. Two winners emerged:

\text{GELU}(z) = z \cdot \Phi(z) \approx 0.5 z \left(1 + \tanh\left[\sqrt{2/\pi}(z + 0.044715 z^3)\right]\right)

\text{SiLU}(z) = \text{Swish}(z) = z \cdot \sigma(z) = \frac{z}{1 + e^{-z}}

Modern smooth activations

$\Phi(z)$: Cumulative distribution function of the standard normal distribution: $\Phi(z) = \mathbb{P}(Z \le z)$ for $Z \sim \mathcal{N}(0,1)$. So GELU is "multiply $z$ by the probability that a Gaussian random variable is smaller than $z$" — a probabilistic interpretation of the gate.
GELU: Gaussian Error Linear Unit (Hendrycks & Gimpel 2016). Looks almost exactly like ReLU for $|z| > 2$ but smoothly transitions around 0. Default activation in BERT, GPT-2, GPT-3, and most modern Transformers. The smoothness gives slightly better gradients and empirically a small but consistent quality boost.
SiLU / Swish: $z \cdot \sigma(z)$. Discovered via automated architecture search (Ramachandran et al. 2017) and independently named SiLU by Elfwing et al. Used in EfficientNet, Llama, Mistral. Self-gated: the input $z$ multiplies its own sigmoid, letting the function pass values through scaled by a soft gate.

Analogy ReLU is a step-function gate: "below 0, closed; above 0, open." GELU and SiLU are the same idea but implemented with a soft lever instead of a hard switch. For very negative inputs the gate is essentially closed; for very positive inputs it's wide open; in between, the function smoothly rolls the gate open in proportion to how "confident" the input is positive. A smooth gate gives cleaner gradients during training, at the cost of a slightly more expensive forward pass.

Source — GELU

// GELU: Gaussian Error Linear Unit (Hendrycks & Gimpel 2016)
// Exact:      GELU(x) = x * Phi(x)     where Phi is the standard normal CDF
// Tanh approx (used in GPT-2, BERT):
//   GELU(x) ~= 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))

// SiLU / Swish (same family):
//   SiLU(x) = x * sigmoid(x) = x / (1 + exp(-x))

import numpy as np

_SQRT_2_OVER_PI = np.sqrt(2.0 / np.pi)  # ~0.7978845608

def gelu(x):
    """Tanh approximation (matches PyTorch's F.gelu approximate='tanh')."""
    return 0.5 * x * (1.0 + np.tanh(_SQRT_2_OVER_PI * (x + 0.044715 * x**3)))

def gelu_exact(x):
    """Exact GELU using the error function: x * Phi(x)."""
    from scipy.special import ndtr  # standard normal CDF
    return x * ndtr(x)

def silu(x):
    """SiLU / Swish: x * sigmoid(x)."""
    return x / (1.0 + np.exp(-x))

# Quick check
x = np.array([-2., -1., 0., 1., 2.])
print("gelu:", gelu(x).round(4))   # approx [-0.0454, -0.1588, 0., 0.8413, 1.9546]
print("silu:", silu(x).round(4))   # approx [-0.2384, -0.2689, 0., 0.7311, 1.7616]

const SQRT_2_OVER_PI = Math.sqrt(2 / Math.PI);  // ~0.7978845608

function gelu(x) {
  const inner = SQRT_2_OVER_PI * (x + 0.044715 * x ** 3);
  return 0.5 * x * (1 + Math.tanh(inner));
}

function silu(x) {
  return x / (1 + Math.exp(-x));
}

const geluVec = xs => xs.map(gelu);
const siluVec = xs => xs.map(silu);

#include <math.h>

#define SQRT_2_OVER_PI 0.7978845608f

float gelu(float x) {
    float inner = SQRT_2_OVER_PI * (x + 0.044715f * x * x * x);
    return 0.5f * x * (1.f + tanhf(inner));
}

float silu(float x) {
    return x / (1.f + expf(-x));
}

void gelu_forward(const float *in, float *out, int n) {
    for (int i = 0; i < n; i++) out[i] = gelu(in[i]);
}

#include <cmath>
#include <vector>

constexpr float SQRT_2_OVER_PI = 0.7978845608f;

inline float gelu(float x) {
    float inner = SQRT_2_OVER_PI * (x + 0.044715f * x * x * x);
    return 0.5f * x * (1.f + std::tanh(inner));
}

inline float silu(float x) {
    return x / (1.f + std::exp(-x));
}

std::vector<float> applyGelu(const std::vector<float> &v) {
    std::vector<float> out(v.size());
    for (size_t i = 0; i < v.size(); i++) out[i] = gelu(v[i]);
    return out;
}

public class GELU {
    private static final float SQRT_2_OVER_PI = (float) Math.sqrt(2.0 / Math.PI);

    public static float gelu(float x) {
        float inner = SQRT_2_OVER_PI * (x + 0.044715f * x * x * x);
        return 0.5f * x * (1f + (float) Math.tanh(inner));
    }

    public static float silu(float x) {
        return x / (1f + (float) Math.exp(-x));
    }

    public static float[] geluForward(float[] in) {
        float[] out = new float[in.length];
        for (int i = 0; i < in.length; i++) out[i] = gelu(in[i]);
        return out;
    }
}

package main

import "math"

const sqrt2OverPi = 0.7978845608

func gelu(x float64) float64 {
    inner := sqrt2OverPi * (x + 0.044715*x*x*x)
    return 0.5 * x * (1 + math.Tanh(inner))
}

func silu(x float64) float64 {
    return x / (1 + math.Exp(-x))
}

func geluSlice(xs []float64) []float64 {
    out := make([]float64, len(xs))
    for i, x := range xs { out[i] = gelu(x) }
    return out
}

7. Side-by-side comparison

Here's how seven activations look on the same axes — and how their derivatives compare.

Left: each activation's output as a function of its input. Right: derivatives. Notice how sigmoid/tanh derivatives vanish in the tails (the vanishing-gradient regime), while ReLU/GELU/SiLU all converge to 1 for large positive $z$ and stay healthy far from the origin.

Interactive explorer

Click any activation below to see its value and derivative at a specific $z$. The dot on each curve shows where you are.

▸ Evaluate at z z = 0.00

Pick an activation

Click a button below. Move the slider to change $z$. Watch both $\phi(z)$ (the curve value at the dot) and the derivative $\phi'(z)$.

select an activation to begin

z =

8. Source code

All the modern activations are one-liners. Here's a grab-bag.

Source — Activations Forward Pass (NumPy / PyTorch / JAX / CUDA)

activations · forward pass

import numpy as np

def sigmoid(z): return 1 / (1 + np.exp(-z))
def tanh(z):    return np.tanh(z)
def relu(z):    return np.maximum(0, z)
def leaky(z, a=0.01):  return np.where(z > 0, z, a * z)
def elu(z, a=1.0):     return np.where(z > 0, z, a * (np.exp(z) - 1))
def gelu(z):    return 0.5 * z * (1 + np.tanh(np.sqrt(2 / np.pi) * (z + 0.044715 * z**3)))
def silu(z):    return z * sigmoid(z)

import torch
import torch.nn.functional as F

# All fused, all on GPU, all already handled by autograd.
y = torch.sigmoid(z)
y = torch.tanh(z)
y = F.relu(z)
y = F.leaky_relu(z, negative_slope=0.01)
y = F.elu(z, alpha=1.0)
y = F.gelu(z)                            # 'none' approx by default since PyTorch 1.12
y = F.silu(z)                            # a.k.a. Swish

import jax.numpy as jnp
from jax import nn

y = nn.sigmoid(z)
y = jnp.tanh(z)
y = nn.relu(z)
y = nn.leaky_relu(z, negative_slope=0.01)
y = nn.elu(z)
y = nn.gelu(z, approximate=True)         # tanh-based approximation
y = nn.silu(z)

// Elementwise ReLU kernel. The forward pass of ReLU in cuDNN
// is basically this, plus bounds checks and vector loads.
__global__ void relu_forward(const float* __restrict__ x,
                             float* __restrict__ y,
                             int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) y[i] = x[i] > 0.f ? x[i] : 0.f;
}

// GELU (tanh approximation)
__device__ inline float gelu_approx(float x) {
    const float c = 0.7978845608f;     // sqrt(2/pi)
    float inner = c * (x + 0.044715f * x * x * x);
    return 0.5f * x * (1.f + tanhf(inner));
}

__global__ void gelu_forward(const float* x, float* y, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) y[i] = gelu_approx(x[i]);
}

Source — Softmax

// Numerically stable softmax
// Naive: softmax(z)_i = exp(z_i) / sum_j exp(z_j)
// Problem: exp(z_i) overflows for large z_i.
// Fix: subtract max first (doesn't change output due to cancellation).
//
// softmax(z)_i = exp(z_i - max(z)) / sum_j exp(z_j - max(z))

function softmax(z):
    m   = max(z)
    e   = exp(z - m)        // subtract max for numerical stability
    return e / sum(e)       // normalise to sum to 1

import numpy as np

def softmax(z):
    """Numerically stable softmax along last axis."""
    z = np.asarray(z, dtype=float)
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / exp_z.sum(axis=-1, keepdims=True)

def log_softmax(z):
    """log-softmax: numerically stable, used in cross-entropy loss."""
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    return z_shifted - np.log(np.sum(np.exp(z_shifted), axis=-1, keepdims=True))

# Example: 3-class logits
logits = np.array([2.0, 1.0, 0.1])
probs  = softmax(logits)
print(probs.round(4))          # [0.6590, 0.2424, 0.0986]
print(probs.sum())             # 1.0

# Batch input (shape B x C)
batch_logits = np.random.randn(8, 10)
batch_probs  = softmax(batch_logits)

function softmax(z) {
  const max = Math.max(...z);
  const exp = z.map(v => Math.exp(v - max));   // subtract max for stability
  const sum = exp.reduce((a, b) => a + b, 0);
  return exp.map(v => v / sum);
}

function logSoftmax(z) {
  const max = Math.max(...z);
  const shifted = z.map(v => v - max);
  const logSum = Math.log(shifted.map(v => Math.exp(v)).reduce((a, b) => a + b, 0));
  return shifted.map(v => v - logSum);
}

// Example
const logits = [2.0, 1.0, 0.1];
console.log(softmax(logits).map(v => v.toFixed(4)));
// ["0.6590", "0.2424", "0.0986"]

#include <math.h>
#include <float.h>

void softmax(const float *z, float *out, int n) {
    /* Find max for numerical stability */
    float max_val = z[0];
    for (int i = 1; i < n; i++) if (z[i] > max_val) max_val = z[i];

    /* Compute shifted exp and their sum */
    float sum = 0.f;
    for (int i = 0; i < n; i++) {
        out[i] = expf(z[i] - max_val);
        sum += out[i];
    }
    /* Normalise */
    for (int i = 0; i < n; i++) out[i] /= sum;
}

#include <vector>
#include <cmath>
#include <algorithm>
#include <numeric>

std::vector<float> softmax(const std::vector<float> &z) {
    float maxVal = *std::max_element(z.begin(), z.end());
    std::vector<float> out(z.size());
    float sum = 0.f;
    for (size_t i = 0; i < z.size(); i++) {
        out[i] = std::exp(z[i] - maxVal);
        sum += out[i];
    }
    for (auto &v : out) v /= sum;
    return out;
}

public class Softmax {
    public static float[] softmax(float[] z) {
        float max = z[0];
        for (float v : z) if (v > max) max = v;

        float[] out = new float[z.length];
        float sum = 0f;
        for (int i = 0; i < z.length; i++) {
            out[i] = (float) Math.exp(z[i] - max);
            sum += out[i];
        }
        for (int i = 0; i < out.length; i++) out[i] /= sum;
        return out;
    }

    public static float[] logSoftmax(float[] z) {
        float max = z[0];
        for (float v : z) if (v > max) max = v;

        float logSum = 0f;
        for (float v : z) logSum += Math.exp(v - max);
        logSum = (float)(Math.log(logSum) + max);

        float[] out = new float[z.length];
        for (int i = 0; i < z.length; i++) out[i] = z[i] - logSum;
        return out;
    }
}

package main

import "math"

func softmax(z []float64) []float64 {
    maxVal := z[0]
    for _, v := range z { if v > maxVal { maxVal = v } }

    out  := make([]float64, len(z))
    sum  := 0.0
    for i, v := range z {
        out[i] = math.Exp(v - maxVal)
        sum += out[i]
    }
    for i := range out { out[i] /= sum }
    return out
}

func logSoftmax(z []float64) []float64 {
    maxVal := z[0]
    for _, v := range z { if v > maxVal { maxVal = v } }

    logSum := 0.0
    for _, v := range z { logSum += math.Exp(v - maxVal) }
    logSum = math.Log(logSum) + maxVal

    out := make([]float64, len(z))
    for i, v := range z { out[i] = v - logSum }
    return out
}

Source — Activation Derivatives

// Derivatives used during backpropagation
// dL/dz = dL/dy * dy/dz   (chain rule, elementwise)

sigmoid'(x)     = sigmoid(x) * (1 - sigmoid(x))
                           // max value 0.25 at x=0; vanishes for |x|>>0

tanh'(x)        = 1 - tanh(x)^2
                           // max value 1 at x=0; still vanishes for |x|>>0

relu'(x)        = 1  if x > 0
                = 0  if x < 0
                = undefined at x=0  (use 0 by convention — subgradient)

leaky_relu'(x)  = 1     if x > 0
                = alpha  if x <= 0

elu'(x)         = 1              if x > 0
                = alpha * exp(x) if x <= 0   (= elu(x) + alpha)

gelu'(x)        = 0.5*(1 + tanh(c*(x+0.044715*x^3)))
                + 0.5*x * sech^2(c*(x+0.044715*x^3)) * c*(1+3*0.044715*x^2)
                  where c = sqrt(2/pi)

import numpy as np

def sigmoid(x):
    return np.where(x >= 0,
                    1 / (1 + np.exp(-x)),
                    np.exp(x) / (1 + np.exp(x)))

def sigmoid_deriv(x):
    s = sigmoid(x)
    return s * (1 - s)

def tanh_deriv(x):
    return 1.0 - np.tanh(x) ** 2

def relu_deriv(x):
    return (x > 0).astype(float)

def leaky_relu_deriv(x, alpha=0.01):
    return np.where(x > 0, 1.0, alpha)

def elu_deriv(x, alpha=1.0):
    return np.where(x > 0, 1.0, alpha * np.exp(x))

_C = np.sqrt(2.0 / np.pi)

def gelu_deriv(x):
    u     = _C * (x + 0.044715 * x**3)
    tanh_u = np.tanh(u)
    sech2  = 1 - tanh_u**2
    du_dx  = _C * (1 + 3 * 0.044715 * x**2)
    return 0.5 * (1 + tanh_u) + 0.5 * x * sech2 * du_dx

# Verify numerically at x=1
x = np.array([1.0])
h = 1e-5
print("gelu'(1) numerical:", ((gelu(x+h) - gelu(x-h)) / (2*h))[0].round(6))
print("gelu'(1) analytic: ", gelu_deriv(x)[0].round(6))

def gelu(x):
    return 0.5 * x * (1 + np.tanh(_C * (x + 0.044715 * x**3)))

function sigmoidDeriv(x) { const s = sigmoid(x); return s * (1 - s); }
function tanhDeriv(x)    { const t = Math.tanh(x); return 1 - t * t; }
function reluDeriv(x)    { return x > 0 ? 1 : 0; }

function leakyReluDeriv(x, alpha = 0.01) { return x > 0 ? 1 : alpha; }
function eluDeriv(x, alpha = 1.0) { return x > 0 ? 1 : alpha * Math.exp(x); }

const C = Math.sqrt(2 / Math.PI);
function geluDeriv(x) {
  const u     = C * (x + 0.044715 * x ** 3);
  const tanhU = Math.tanh(u);
  const sech2 = 1 - tanhU ** 2;
  const duDx  = C * (1 + 3 * 0.044715 * x ** 2);
  return 0.5 * (1 + tanhU) + 0.5 * x * sech2 * duDx;
}

// sigmoid is assumed to be defined from earlier snippet
function sigmoid(x) {
  return x >= 0 ? 1 / (1 + Math.exp(-x))
                : Math.exp(x) / (1 + Math.exp(x));
}

#include <math.h>

float sigmoid_deriv(float x) { float s = sigmoid(x); return s * (1.f - s); }
float tanh_deriv(float x)    { float t = tanhf(x);  return 1.f - t * t; }
float relu_deriv(float x)    { return x > 0.f ? 1.f : 0.f; }
float leaky_relu_deriv(float x, float alpha) { return x > 0.f ? 1.f : alpha; }
float elu_deriv(float x, float alpha)        { return x > 0.f ? 1.f : alpha * expf(x); }

#define C_SQRT2PI 0.7978845608f

float gelu_deriv(float x) {
    float u     = C_SQRT2PI * (x + 0.044715f * x * x * x);
    float tanh_u = tanhf(u);
    float sech2  = 1.f - tanh_u * tanh_u;
    float du_dx  = C_SQRT2PI * (1.f + 3.f * 0.044715f * x * x);
    return 0.5f * (1.f + tanh_u) + 0.5f * x * sech2 * du_dx;
}

/* sigmoid defined in earlier snippet */
float sigmoid(float x);

#include <cmath>

inline float sigmoidDeriv(float x) { float s = sigmoid(x); return s * (1.f - s); }
inline float tanhDeriv(float x)    { float t = std::tanh(x); return 1.f - t * t; }
inline float reluDeriv(float x)    { return x > 0.f ? 1.f : 0.f; }
inline float leakyReluDeriv(float x, float a = 0.01f) { return x > 0.f ? 1.f : a; }
inline float eluDeriv(float x, float a = 1.0f) { return x > 0.f ? 1.f : a * std::exp(x); }

constexpr float SQRT2PI = 0.7978845608f;
inline float geluDeriv(float x) {
    float u     = SQRT2PI * (x + 0.044715f * x * x * x);
    float tanh_u = std::tanh(u);
    float sech2  = 1.f - tanh_u * tanh_u;
    float du_dx  = SQRT2PI * (1.f + 3.f * 0.044715f * x * x);
    return 0.5f * (1.f + tanh_u) + 0.5f * x * sech2 * du_dx;
}

// sigmoid declared/defined in SigmoidTanh snippet above
float sigmoid(float x);

public class ActivationDerivs {
    public static float sigmoidDeriv(float x) {
        float s = SigmoidTanh.sigmoid(x); return s * (1f - s);
    }
    public static float tanhDeriv(float x) {
        float t = (float) Math.tanh(x); return 1f - t * t;
    }
    public static float reluDeriv(float x)    { return x > 0f ? 1f : 0f; }
    public static float leakyReluDeriv(float x, float a) { return x > 0f ? 1f : a; }
    public static float eluDeriv(float x, float a) {
        return x > 0f ? 1f : a * (float) Math.exp(x);
    }

    private static final float SQRT2PI = (float) Math.sqrt(2.0 / Math.PI);
    public static float geluDeriv(float x) {
        float u     = SQRT2PI * (x + 0.044715f * x * x * x);
        float tanhU = (float) Math.tanh(u);
        float sech2 = 1f - tanhU * tanhU;
        float duDx  = SQRT2PI * (1f + 3f * 0.044715f * x * x);
        return 0.5f * (1f + tanhU) + 0.5f * x * sech2 * duDx;
    }
}

package main

import "math"

func sigmoidDeriv(x float64) float64 { s := sigmoid(x); return s * (1 - s) }
func tanhDeriv(x float64) float64    { t := math.Tanh(x); return 1 - t*t }
func reluDeriv(x float64) float64    { if x > 0 { return 1 }; return 0 }
func leakyReluDeriv(x, alpha float64) float64 { if x > 0 { return 1 }; return alpha }
func eluDerivFn(x, alpha float64) float64     { if x > 0 { return 1 }; return alpha * math.Exp(x) }

const sqrt2pi = 0.7978845608

func geluDeriv(x float64) float64 {
    u     := sqrt2pi * (x + 0.044715*x*x*x)
    tanhU := math.Tanh(u)
    sech2 := 1 - tanhU*tanhU
    duDx  := sqrt2pi * (1 + 3*0.044715*x*x)
    return 0.5*(1+tanhU) + 0.5*x*sech2*duDx
}

9. Which should you use?

Feedforward hidden layers, vanilla CNN: ReLU. It's fast, it works, you can't go wrong.
Transformers (attention MLP blocks): GELU (BERT, GPT-2/3) or SiLU (Llama, Mistral). The smoothness is worth the small extra compute.
Very deep CNN backbones (EfficientNet, MobileNetV3): SiLU / Swish. Found by NAS to be better than ReLU.
Output layer of a binary classifier: sigmoid. The probability interpretation is what you want.
Output layer of a multi-class classifier: softmax (technically a vector-valued activation, but the same idea scaled across logits).
LSTM/GRU internals: sigmoid for gates, tanh for candidate states. This is prescribed by the architecture, not a hyperparameter you tune.
When in doubt: use whatever the state-of-the-art paper in your area uses. The difference between ReLU and GELU is rarely more than 0.5% accuracy, and rarely the bottleneck.

10. Summary

Without a nonlinear activation function, a deep network collapses to a single linear map. Activation functions are what make depth meaningful.
Sigmoid and tanh dominated early networks but suffer from vanishing gradients when stacked deep.
ReLU changed the game in 2010–2012: gradients of 1 on the positive side let deep networks train. Dying ReLU is a real but manageable failure mode.
Leaky / PReLU / ELU fix the dying-ReLU problem with a non-zero negative branch.
GELU and SiLU (Swish) are the modern choices for Transformers and SOTA CNNs. Smooth, self-gating, empirically slightly better than ReLU.
The choice usually matters less than hyperparameters, architecture, and data. But it does matter.