Notes

Lorentz Linear Layers

April 9, 2026
something

Blabla

Activation functions introduce non-linearity into neural networks, enabling them to learn complex mappings. Without them, a deep network would collapse to a single linear transformation.

ReLU

alt text

The Rectified Linear Unit is the most widely used activation function:

\[f(x) = \max(0, x)\]

Its derivative is simple:

\[f'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}\]

Properties:

  • Computationally cheap
  • Sparse activation (dead neurons possible)
  • Does not saturate for positive values

The dying ReLU problem occurs when neurons get stuck outputting zero for all inputs. Leaky ReLU addresses this by allowing a small negative slope $\alpha$:

\[f(x) = \begin{cases} x & x > 0 \\ \alpha x & x \leq 0 \end{cases}\]

Sigmoid

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Maps any real value to $(0, 1)$, making it useful for binary output layers. However, it saturates at both ends, leading to vanishing gradients in deep networks.

The derivative is conveniently expressed as:

\[\sigma'(x) = \sigma(x)(1 - \sigma(x))\]

Softmax

For multi-class output layers, softmax converts a vector $\mathbf{z} \in \mathbb{R}^K$ to a probability distribution:

\[\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}\]

Note that $\sum_i \text{softmax}(\mathbf{z})_i = 1$ by construction.

GELU

The Gaussian Error Linear Unit has become common in transformer architectures:

\[\text{GELU}(x) = x \cdot \Phi(x)\]

where $\Phi(x)$ is the CDF of the standard normal distribution. In practice it is often approximated as:

\[\text{GELU}(x) \approx 0.5x\left(1 + \tanh\!\left[\sqrt{\frac{2}{\pi}}\left(x + 0.044715x^3\right)\right]\right)\]

Unlike ReLU, GELU is smooth and non-monotonic, which seems to help in attention-based models.

The ReLU activation function1 is widely used in practice.

Comparison

Function Range Smooth Saturates
ReLU $[0, \infty)$ No No (positive)
Sigmoid $(0, 1)$ Yes Yes
Tanh $(-1, 1)$ Yes Yes
GELU $\approx(-0.17, \infty)$ Yes No
  1. Nair, V. & Hinton, G. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. ICML