LayerNorm vs. RMSNorm: Why RMSNorm is Enough

02/15/2026

Summary

LayerNorm centers and scales, RMSNorm only scales. A geometric explanation shows why centering is dispensable without losing model quality.

1. Intro

In this post, we’re going to look at why deep neural networks need normalization and why RMSNorm has replaced the original LayerNorm in the context of LLMs.

LLM architectures have changed over the last few years. One thing has remained constant: every relevant model uses some form of normalization. In 2017, that was LayerNorm. Today, it’s mostly RMSNorm, which leaves out the "mean-centering" part. The fact that RMSNorm is more efficient while maintaining the same model quality has been empirically proven many times. But why does it even work?

We’ll start with the question of why normalization is necessary. Then, we’ll look at LayerNorm in detail, followed by RMSNorm. In the main section, we’ll take a geometric perspective that shows why centering is actually unnecessary.

2. Why Normalization?

Without normalization, deep neural networks have a stability problem. During the forward pass, the signal flows through hundreds of layers, and the gradients do the same during the backward pass. In every layer, a matrix multiplication takes place. This leads to a simple mathematical problem: if the weights are on average slightly larger than 1, the values grow exponentially (Exploding Gradients). If they are smaller than 1, they shrink toward zero (Vanishing Gradients).

Normalization tackles this problem by scaling the activations after each layer to a stable range of values—much like a limiter in audio engineering. No matter how much the activations spike, normalization scales them back to a defined level (mean 0, variance 1), so the following layer always receives a stable signal.

Batch Normalization (Ioffe & Szegedy, 2015) was the first widely used implementation of this idea. It was observed early on that normalization allows for higher learning rates and speeds up training. The inventors explained this success by citing a reduction in "Internal Covariate Shift"—the idea that the distribution of inputs for each layer constantly shifts during training and that normalization stabilizes this. However, Bjorck et al. (2018) showed that the mechanism is actually different: the higher learning rates enabled by normalization act as a form of regularization and improve generalization.

Figure 1: Test accuracy with and without Batch Normalization at different learning rates. Bjorck et al. (2018).

The crucial comparison is shown in the right plot. The green curve (with BatchNorm, lr=0.0001) and the red curve (without BatchNorm, lr=0.0001) end up at the same accuracy of about 80%. At the same learning rate, normalization offers no real advantage. The gain is visible in the orange curve: with normalization, the network can use higher learning rates and achieves better test accuracy. Without normalization, the training would diverge.

In short: Normalization stabilizes the signal flow and allows for higher learning rates, which in turn improves generalization.

3. LayerNorm

For their experiments, Bjorck et al. used Batch Normalization, which normalizes across the batch dimension. This works well for images because every image can be scaled to a uniform size. In Transformers, however, this type of normalization is problematic for two reasons. First, texts have variable lengths, meaning shorter sequences must be padded with zeros, which can skew the mean and standard deviation or require additional masking. Second, the memory requirements of large models often limit batches to just one or two samples per GPU. Calculating the mean and variance over so few data points results in noise rather than a stable signal.

Layer Normalization (Ba et al., 2016) changes the dimension across which normalization occurs. Instead of normalizing across the batch (left in Figure 2), it normalizes across the feature dimension of each individual vector (right in Figure 2).

Figure 2: Batch Normalization vs. Layer Normalization (Author's illustration)

The difference is clear in the illustration: BatchNorm calculates the mean and variance for each feature row across all samples in the batch. LayerNorm, on the other hand, calculates the mean and variance for each sample across all its features. This makes the calculation independent of batch size and other sequences.

For an input vector $x$ with $d$ dimensions, LayerNorm first calculates the mean $\mu$ and variance $\sigma^2$ across all features:

$$\mu = \frac{1}{d} \sum_{i=1}^{d} x_i$$ 

$$\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2$$

The vector is then centered (mean subtraction) and scaled (division by standard deviation):

$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$

The ε (typically 10⁻⁵) prevents division by zero with very small variance.

The normalized vector is then transformed with learnable parameters γ (scale) and β (shift):

$$\text{LayerNorm}(x)_i = \gamma_i \cdot \hat{x}_i + \beta_i$$

Without these parameters, every vector would be fixed to a mean of 0 and a variance of 1. $\gamma$ and $\beta$ allow the network to learn its own scaling and shift for each dimension, adapting the range of values to the requirements of the subsequent layers.

Translated into PyTorch:

import torch
import torch.nn as nn

class CustomLayerNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(dim))   # γ (scale)
        self.beta = nn.Parameter(torch.zeros(dim))   # β (shift)

   def forward(self, x):
        # μ = (1/d) Σ x_i
        mean = x.mean(dim=-1, keepdim=True)
        # σ² = (1/d) Σ (x_i - μ)²
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        
        # x̂ = (x - μ) / √(σ² + ε)
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        
        # γ · x̂ + β
        return self.gamma * x_norm + self.beta

4. RMSNorm

As we saw in the previous chapter, LayerNorm consists of two operations: centering and scaling. Zhang & Sennrich (2019) asked themselves: Do we really need both?

Their hypothesis was that centering is unnecessary. Scaling alone is enough to keep the training stable. While centering protects against additive shifts in inputs and weights, according to Zhang & Sennrich, it has almost no impact on training success.

The difference compared to the variance in LayerNorm comes down to a single term: instead of calculating the squared distance to the mean $(x_i - \mu)^2$, RMS takes the squared distance to the zero point $x_i^2$. The mean subtraction is dropped.

$$\text{RMS}(x) = \sqrt{\frac{1}{d} \sum_{i=1}^{d} x_i^2 + \epsilon}$$

If the mean of the inputs is zero, both formulas are identical. The vector is divided by the RMS and scaled by $\gamma$: 

$$\text{RMSNorm}(x)_i = \frac{x_i}{\text{RMS}(x)} \cdot \gamma_i$$

Since centering is gone, $\beta$ is also removed. RMSNorm uses $\gamma$ as its only learnable parameter.

Translated into PyTorch:

class CustomRMSNorm(nn.Module):
   def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(dim))  # γ (scale)
        # no beta!

   def forward(self, x):
        # RMS(x)
        rms = torch.sqrt((x * x).mean(dim=-1, keepdim=True) + self.eps)
        
        # x / RMS(x) · γ
        return (x / rms) * self.gamma

By leaving out the centering, the mean calculation, subtraction, and $\beta$ are eliminated. Zhang & Sennrich show that RMSNorm is 7–64% faster than LayerNorm, depending on the model, while maintaining comparable model quality. The hypothesis was empirically confirmed. However, the question of why centering is dispensable remained open.

5. Why does leaving out centering work?

Gupta et al. (2025) provide a geometric explanation. To understand it, it helps to look at the mean calculation from Chapter 3 again. The mean is the sum of all elements divided by the feature dimension $d$. This sum can also be written as a dot product with a vector of ones (a uniform vector):

$$\mu = \frac{1}{d} \sum_{i=1}^{d} x_i = \frac{1}{d} \mathbf{1}^T \mathbf{x}$$

The dot product $\mathbf{1}^T \mathbf{x}$ multiplies every component of $x$ by 1 and sums the results. The crucial part is the vector $\mathbf{1} = [1, 1, ..., 1]^T$. It points in the direction where all dimensions are equal—a vector with a large component along $\mathbf{1}$ would have similar values in all dimensions (e.g., [5.1, 5.0, 4.9, ...]). So, the dot product $\mathbf{1}^T \mathbf{x}$ measures how strongly $x$ points in that direction.

When LayerNorm subtracts the mean from every component, $\mu$ is multiplied by $\mathbf{1}$ to turn the scalar into a vector: $\mu \cdot \mathbf{1} = [\mu, \mu, ..., \mu]^T$.

$$\mathbf{x} - \mu * \mathbf{1} = \mathbf{x} - \left( \frac{\mathbf{1}^T \mathbf{x}}{d} \right) * \mathbf{1}$$

The following visualization shows this process in 3D: The input vector $x$ is projected onto the uniform vector $\mathbf{1}$ (orange). Mean-centering removes this projection (green), and scaling normalizes the result (blue).

Figure 3: Geometric interpretation of the LayerNorm process (Author's illustration)

What remains is perpendicular to $\mathbf{1}$. This operation would be useful if relevant information were stored along this direction. But is there actually anything there?

Gupta et al. measured the angle between hidden vectors and the uniform vector in various LLMs before normalization. The result: the hidden vectors are already almost perfectly orthogonal to the uniform vector. The component that LayerNorm removes is nearly zero.

This aligns with the phenomenon of Concentration of Measure. In high-dimensional spaces, like $d = 4096$ in Llama, any two random vectors are almost always nearly orthogonal to each other. From the model's perspective, the uniform vector is not a special direction. It only appears in the math of the mean calculation. The model doesn't organize its representations along this direction.

In short: The mean subtraction in LayerNorm removes a component that barely exists. RMSNorm simply skips this step.

6. Practical Implications

RMSNorm has established itself as the standard in newer LLM architectures like Qwen3. A common concern was that the lack of centering might cause issues during quantization, as quantization algorithms work most precisely with zero-centered distributions. Gupta et al. (2025) debunk this: their measurements show that even RMSNorm models keep their representations centered on their own. The model learns centering as a side effect of training.

The 7–64% speedup mentioned in the original 2019 paper comes from a time before fused kernels—optimized GPU routines that combine several operations into a single call. With such kernels, both variants run at similar speeds. In fact, until mid-2025, PyTorch's nn.RMSNorm did not have a fused kernel and was actually slower than nn.LayerNorm. The practical advantage of RMSNorm is more apparent in inference frameworks like vLLM: fewer operations make it easier to "fuse" the normalization with residual addition or quantization into a single kernel.

7. Outlook: Do we even need normalization?

RMSNorm removes the centering from LayerNorm. Current research goes one step further and asks whether the normalization layer itself is even necessary.

One approach replaces normalization with simpler functions. Zhu et al. (CVPR 2025) observed that LayerNorm in trained networks resembles a tanh curve. From this, they derived Dynamic Tanh (DyT): an element-wise function with a single learnable parameter, without any mean or variance. LLaMA models up to 70B achieve comparable results using this method. Lu et al. (2025) refined this idea with a variant of the Gaussian Error Function (Derf), which outperforms DyT across multiple domains.

Another approach makes normalization redundant by design. Loshchilov et al. (2025) use nGPT to keep all weights and embeddings permanently on a unit norm, preventing the hidden state from drifting in the first place. RMSNorm and LayerNorm are completely removed. The result: significantly fewer training tokens required for the same performance. Whether these approaches will prove themselves at production-scale model sizes remains to be seen.

Resources

Ba, J.L., Kiros, J.R. and Hinton, G.E. (2016). Layer Normalization. URL: https://arxiv.org/abs/1607.06450

Bjorck, N. et al. (2018). Understanding Batch Normalization. URL: https://arxiv.org/abs/1805.11604

Gupta, A., Ozdemir, A. and Anumanchipalli, G. (2025). Geometric Interpretation of Layer Normalization. URL: https://arxiv.org/abs/2409.12951

Ioffe, S. and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. URL: https://arxiv.org/abs/1502.03167

Loshchilov, I. et al. (2025). nGPT: Normalized Transformer with Representation Learning on the Hypersphere. URL: https://arxiv.org/abs/2410.01131

Lu, T. et al. (2025). Stronger Normalization-Free Transformers. URL: https://arxiv.org/abs/2512.10938

Zhang, B. and Sennrich, R. (2019). Root Mean Square Layer Normalization. URL: https://arxiv.org/abs/1910.07467

Zhu, J. et al. (2025). Transformers without Normalization. URL: https://arxiv.org/abs/2503.10622

LayerNorm vs. RMSNorm: Why RMSNorm is Enough

02/15/2026

Summary

LayerNorm centers and scales, RMSNorm only scales. A geometric explanation shows why centering is dispensable without losing model quality.

1. Intro

In this post, we’re going to look at why deep neural networks need normalization and why RMSNorm has replaced the original LayerNorm in the context of LLMs.

LLM architectures have changed over the last few years. One thing has remained constant: every relevant model uses some form of normalization. In 2017, that was LayerNorm. Today, it’s mostly RMSNorm, which leaves out the "mean-centering" part. The fact that RMSNorm is more efficient while maintaining the same model quality has been empirically proven many times. But why does it even work?

We’ll start with the question of why normalization is necessary. Then, we’ll look at LayerNorm in detail, followed by RMSNorm. In the main section, we’ll take a geometric perspective that shows why centering is actually unnecessary.

2. Why Normalization?

Without normalization, deep neural networks have a stability problem. During the forward pass, the signal flows through hundreds of layers, and the gradients do the same during the backward pass. In every layer, a matrix multiplication takes place. This leads to a simple mathematical problem: if the weights are on average slightly larger than 1, the values grow exponentially (Exploding Gradients). If they are smaller than 1, they shrink toward zero (Vanishing Gradients).

Normalization tackles this problem by scaling the activations after each layer to a stable range of values—much like a limiter in audio engineering. No matter how much the activations spike, normalization scales them back to a defined level (mean 0, variance 1), so the following layer always receives a stable signal.

Batch Normalization (Ioffe & Szegedy, 2015) was the first widely used implementation of this idea. It was observed early on that normalization allows for higher learning rates and speeds up training. The inventors explained this success by citing a reduction in "Internal Covariate Shift"—the idea that the distribution of inputs for each layer constantly shifts during training and that normalization stabilizes this. However, Bjorck et al. (2018) showed that the mechanism is actually different: the higher learning rates enabled by normalization act as a form of regularization and improve generalization.

Figure 1: Test accuracy with and without Batch Normalization at different learning rates. Bjorck et al. (2018).

The crucial comparison is shown in the right plot. The green curve (with BatchNorm, lr=0.0001) and the red curve (without BatchNorm, lr=0.0001) end up at the same accuracy of about 80%. At the same learning rate, normalization offers no real advantage. The gain is visible in the orange curve: with normalization, the network can use higher learning rates and achieves better test accuracy. Without normalization, the training would diverge.

In short: Normalization stabilizes the signal flow and allows for higher learning rates, which in turn improves generalization.

3. LayerNorm

For their experiments, Bjorck et al. used Batch Normalization, which normalizes across the batch dimension. This works well for images because every image can be scaled to a uniform size. In Transformers, however, this type of normalization is problematic for two reasons. First, texts have variable lengths, meaning shorter sequences must be padded with zeros, which can skew the mean and standard deviation or require additional masking. Second, the memory requirements of large models often limit batches to just one or two samples per GPU. Calculating the mean and variance over so few data points results in noise rather than a stable signal.

Layer Normalization (Ba et al., 2016) changes the dimension across which normalization occurs. Instead of normalizing across the batch (left in Figure 2), it normalizes across the feature dimension of each individual vector (right in Figure 2).

Figure 2: Batch Normalization vs. Layer Normalization (Author's illustration)

The difference is clear in the illustration: BatchNorm calculates the mean and variance for each feature row across all samples in the batch. LayerNorm, on the other hand, calculates the mean and variance for each sample across all its features. This makes the calculation independent of batch size and other sequences.

For an input vector $x$ with $d$ dimensions, LayerNorm first calculates the mean $\mu$ and variance $\sigma^2$ across all features:

$$\mu = \frac{1}{d} \sum_{i=1}^{d} x_i$$ 

$$\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2$$

The vector is then centered (mean subtraction) and scaled (division by standard deviation):

$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$

The ε (typically 10⁻⁵) prevents division by zero with very small variance.

The normalized vector is then transformed with learnable parameters γ (scale) and β (shift):

$$\text{LayerNorm}(x)_i = \gamma_i \cdot \hat{x}_i + \beta_i$$

Without these parameters, every vector would be fixed to a mean of 0 and a variance of 1. $\gamma$ and $\beta$ allow the network to learn its own scaling and shift for each dimension, adapting the range of values to the requirements of the subsequent layers.

Translated into PyTorch:

import torch
import torch.nn as nn

class CustomLayerNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(dim))   # γ (scale)
        self.beta = nn.Parameter(torch.zeros(dim))   # β (shift)

   def forward(self, x):
        # μ = (1/d) Σ x_i
        mean = x.mean(dim=-1, keepdim=True)
        # σ² = (1/d) Σ (x_i - μ)²
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        
        # x̂ = (x - μ) / √(σ² + ε)
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        
        # γ · x̂ + β
        return self.gamma * x_norm + self.beta

4. RMSNorm

As we saw in the previous chapter, LayerNorm consists of two operations: centering and scaling. Zhang & Sennrich (2019) asked themselves: Do we really need both?

Their hypothesis was that centering is unnecessary. Scaling alone is enough to keep the training stable. While centering protects against additive shifts in inputs and weights, according to Zhang & Sennrich, it has almost no impact on training success.

The difference compared to the variance in LayerNorm comes down to a single term: instead of calculating the squared distance to the mean $(x_i - \mu)^2$, RMS takes the squared distance to the zero point $x_i^2$. The mean subtraction is dropped.

$$\text{RMS}(x) = \sqrt{\frac{1}{d} \sum_{i=1}^{d} x_i^2 + \epsilon}$$

If the mean of the inputs is zero, both formulas are identical. The vector is divided by the RMS and scaled by $\gamma$: 

$$\text{RMSNorm}(x)_i = \frac{x_i}{\text{RMS}(x)} \cdot \gamma_i$$

Since centering is gone, $\beta$ is also removed. RMSNorm uses $\gamma$ as its only learnable parameter.

Translated into PyTorch:

class CustomRMSNorm(nn.Module):
   def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(dim))  # γ (scale)
        # no beta!

   def forward(self, x):
        # RMS(x)
        rms = torch.sqrt((x * x).mean(dim=-1, keepdim=True) + self.eps)
        
        # x / RMS(x) · γ
        return (x / rms) * self.gamma

By leaving out the centering, the mean calculation, subtraction, and $\beta$ are eliminated. Zhang & Sennrich show that RMSNorm is 7–64% faster than LayerNorm, depending on the model, while maintaining comparable model quality. The hypothesis was empirically confirmed. However, the question of why centering is dispensable remained open.

5. Why does leaving out centering work?

Gupta et al. (2025) provide a geometric explanation. To understand it, it helps to look at the mean calculation from Chapter 3 again. The mean is the sum of all elements divided by the feature dimension $d$. This sum can also be written as a dot product with a vector of ones (a uniform vector):

$$\mu = \frac{1}{d} \sum_{i=1}^{d} x_i = \frac{1}{d} \mathbf{1}^T \mathbf{x}$$

The dot product $\mathbf{1}^T \mathbf{x}$ multiplies every component of $x$ by 1 and sums the results. The crucial part is the vector $\mathbf{1} = [1, 1, ..., 1]^T$. It points in the direction where all dimensions are equal—a vector with a large component along $\mathbf{1}$ would have similar values in all dimensions (e.g., [5.1, 5.0, 4.9, ...]). So, the dot product $\mathbf{1}^T \mathbf{x}$ measures how strongly $x$ points in that direction.

When LayerNorm subtracts the mean from every component, $\mu$ is multiplied by $\mathbf{1}$ to turn the scalar into a vector: $\mu \cdot \mathbf{1} = [\mu, \mu, ..., \mu]^T$.

$$\mathbf{x} - \mu * \mathbf{1} = \mathbf{x} - \left( \frac{\mathbf{1}^T \mathbf{x}}{d} \right) * \mathbf{1}$$

The following visualization shows this process in 3D: The input vector $x$ is projected onto the uniform vector $\mathbf{1}$ (orange). Mean-centering removes this projection (green), and scaling normalizes the result (blue).

Figure 3: Geometric interpretation of the LayerNorm process (Author's illustration)

What remains is perpendicular to $\mathbf{1}$. This operation would be useful if relevant information were stored along this direction. But is there actually anything there?

Gupta et al. measured the angle between hidden vectors and the uniform vector in various LLMs before normalization. The result: the hidden vectors are already almost perfectly orthogonal to the uniform vector. The component that LayerNorm removes is nearly zero.

This aligns with the phenomenon of Concentration of Measure. In high-dimensional spaces, like $d = 4096$ in Llama, any two random vectors are almost always nearly orthogonal to each other. From the model's perspective, the uniform vector is not a special direction. It only appears in the math of the mean calculation. The model doesn't organize its representations along this direction.

In short: The mean subtraction in LayerNorm removes a component that barely exists. RMSNorm simply skips this step.

6. Practical Implications

RMSNorm has established itself as the standard in newer LLM architectures like Qwen3. A common concern was that the lack of centering might cause issues during quantization, as quantization algorithms work most precisely with zero-centered distributions. Gupta et al. (2025) debunk this: their measurements show that even RMSNorm models keep their representations centered on their own. The model learns centering as a side effect of training.

The 7–64% speedup mentioned in the original 2019 paper comes from a time before fused kernels—optimized GPU routines that combine several operations into a single call. With such kernels, both variants run at similar speeds. In fact, until mid-2025, PyTorch's nn.RMSNorm did not have a fused kernel and was actually slower than nn.LayerNorm. The practical advantage of RMSNorm is more apparent in inference frameworks like vLLM: fewer operations make it easier to "fuse" the normalization with residual addition or quantization into a single kernel.

7. Outlook: Do we even need normalization?

RMSNorm removes the centering from LayerNorm. Current research goes one step further and asks whether the normalization layer itself is even necessary.

One approach replaces normalization with simpler functions. Zhu et al. (CVPR 2025) observed that LayerNorm in trained networks resembles a tanh curve. From this, they derived Dynamic Tanh (DyT): an element-wise function with a single learnable parameter, without any mean or variance. LLaMA models up to 70B achieve comparable results using this method. Lu et al. (2025) refined this idea with a variant of the Gaussian Error Function (Derf), which outperforms DyT across multiple domains.

Another approach makes normalization redundant by design. Loshchilov et al. (2025) use nGPT to keep all weights and embeddings permanently on a unit norm, preventing the hidden state from drifting in the first place. RMSNorm and LayerNorm are completely removed. The result: significantly fewer training tokens required for the same performance. Whether these approaches will prove themselves at production-scale model sizes remains to be seen.

Resources

Ba, J.L., Kiros, J.R. and Hinton, G.E. (2016). Layer Normalization. URL: https://arxiv.org/abs/1607.06450

Bjorck, N. et al. (2018). Understanding Batch Normalization. URL: https://arxiv.org/abs/1805.11604

Gupta, A., Ozdemir, A. and Anumanchipalli, G. (2025). Geometric Interpretation of Layer Normalization. URL: https://arxiv.org/abs/2409.12951

Ioffe, S. and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. URL: https://arxiv.org/abs/1502.03167

Loshchilov, I. et al. (2025). nGPT: Normalized Transformer with Representation Learning on the Hypersphere. URL: https://arxiv.org/abs/2410.01131

Lu, T. et al. (2025). Stronger Normalization-Free Transformers. URL: https://arxiv.org/abs/2512.10938

Zhang, B. and Sennrich, R. (2019). Root Mean Square Layer Normalization. URL: https://arxiv.org/abs/1910.07467

Zhu, J. et al. (2025). Transformers without Normalization. URL: https://arxiv.org/abs/2503.10622