Lorentz RMSNorm
Derivation of a Lorentz RMSNorm variant.
Euclidean RMSNorm
In Eculidean space, RMSNorm1 normlaizes a vector $x \in \mathbb{R}^n$ by its RMS magnitude and rescales by a learned $\gamma$:
\[\text{RMSNorm}(\mathbf{x}) = \gamma \cdot \frac{\mathbf{x}}{\|\mathbf{x}\|/\sqrt{n}}\]This is radial rescaling, it moves $\mathbf{x}$ along the ray through the origin to a target norm $\gamma \sqrt{n}$ preserving direction. The learned $\gamma$ controls the target scale.
Hyperbolic Analogue
The natural generalization replaces:
- The Euclidean norm $|\mathbf{x}|$ with the hyperbolic norm $|x|{L} = d{\mathcal{L}}(o,x)$, the geodesic distance from the origin
- Radial rescaling in $\mathbb{R}^n$ with geodesic scaling along the unique geodesic through the origin and $x$, which is the gyroscalar multiplication $\odot$
For $x \in \mathcal{L}^n_k$, the hyperbolic norm is:
\[\|x\|_{L} = d_{\mathcal{L}}(o,x) = \frac{1}{\sqrt{-k}}\text{arccosh}(\sqrt{-k} \cdot x_0)\]where $x_0$ is the time component of $x$ and $o = (1/\sqrt{-k}, \mathbf{0})$ is the origin.
Lorentz RMSNorm rescales each token to a learned target hyperbolic norm $\gamma > 0$ via gyroscalar multiplication:
\[\text{LRMSNom}(x) = \frac{\gamma}{\| x \|_L} \odot x\]where gyroscalar multiplication $t \odot x = \text{Exp}_o(t \cdot \text{Log}_o(x))$ moves $x$ along the geodesic through the origin by factor $t$. Since
\[d_{\mathcal{L}}(o, t \odot x) = t d_{\mathcal{L}}(o,x)\]if follows that $| \text{LRMSNorm}(x)|_L$ has geodesic radius $\gamma$ exactly. As $k \to 0$, LRMSNorm reduces to Euclidean RMSNorm with a scalar $\gamma$.
-
Zhang, B., & Sennrich, R. (2019). Root mean square layer normalization. Advances in neural information processing systems, 32. ↩