Networks compute summations of products. If the inputs to one layer are on the scale of hundreds and the next layer’s weights aren’t tiny, the next layer’s pre-activations will be enormous — and once an activation function saturates or a gradient explodes, learning stalls. Normalisation keeps numbers in a sane range, both at the input and inside the network.

Why raw inputs are dangerous

Take an 8-bit greyscale image: every pixel is an integer in . Feed it directly into a layer whose weights are . The pre-activation is a sum of hundreds of such products and easily reaches several thousand in magnitude. Apply an activation:

  • ReLU: the output is in the thousands. The next layer multiplies again and the values keep growing.
  • Sigmoid/tanh: is so far from zero that the function saturates — derivative effectively zero, no gradient.

Either way the network either explodes or stops learning. The fix is preprocessing the input so values cluster around zero with a small spread.

Three input normalisation schemes

Three common ways to rescale a real-valued input feature :

SchemeFormulaOutput range
Fixed scaling (for 8-bit images)
Min-max
Z-scoremean 0, std 1

Z-score normalisation is the most common in deep learning. The mean and standard deviation are computed from the training set and reused at test time. The result lands inputs near the “useful” zero region of common activation functions (ReLU’s elbow, sigmoid’s steep central section).

TIP — Compute statistics on training data only

and are properties of the training distribution. Recomputing them from validation or test data would leak information from those splits into the model’s preprocessing — a subtle form of train/test contamination. Always compute statistics once on training data and apply the same shift and scale at validation and test time.

Why input normalisation isn’t enough

Normalising the input fixes layer 1. But layer 2 receives layer 1’s outputs — and as layer 1 trains, its weights change, so its output distribution drifts. Layer 2 is now chasing a moving target: by the time it has adapted to one input distribution, layer 1 has shifted to a new one. This is internal covariate shift, and it gets worse with depth: layer 10 in a 20-layer network has nine layers of drift to track.

The cure is to normalise inside the network, after every layer’s pre-activations. That’s batch normalisation.

Batch normalisation

Place a normalisation operation between a layer’s linear part () and its activation function. For each mini-batch of examples, compute the mean and variance of across the batch:

Normalise each example using these batch statistics:

The () prevents division by zero when the batch happens to have constant pre-activations. After this step, has mean 0 and variance 1 across the batch.

The learnable scale and shift

Forcing every layer’s outputs to be exactly mean 0 and variance 1 is too restrictive — sometimes the network really does want differently-scaled activations. So batch norm adds two learnable parameters per channel, and :

and let the model re-scale and re-centre the normalised activations — even undoing normalisation entirely if helpful (set , ). The point isn’t to hand-tune the scale; it’s to let the network, not the human, decide whether to keep the normalisation, undo it, or end up somewhere in between.

TIP — Why and are learnable

Without them, batch norm forces a hard zero-mean unit-variance constraint at every layer — that constraint can hurt the network if the optimal pre-activation distribution is genuinely different. Adding and as learnable parameters means the network, not the human, decides whether to keep the normalisation, undo it, or end up somewhere in between. Crucially, even when the network learns to undo the normalisation, gradient descent now starts from a well-scaled state instead of an arbitrary one.

Train time vs test time

At training time, and come from the current mini-batch. At test time you may only have one example — there’s no batch to compute statistics from. The standard fix:

  1. During training, maintain a running average of and across mini-batches.
  2. At test time, freeze and use these running estimates instead of computing fresh batch statistics.

So batch norm has two modes of operation: training mode (computes per-batch stats, updates running averages) and evaluation mode (uses frozen running averages). Forgetting to switch modes is one of the most common batch-norm bugs.

Batch norm for CNNs: per-channel statistics

A convolution layer’s output is a 4D tensor of shape . Because the same filter slid over every spatial position to produce that channel, every spatial location of one channel is computed by the same weights — so the natural statistic to normalise is the entire channel pooled across batch and space. For each channel, compute and over the batch dimension and the spatial dimensions, then learn one pair per channel.

The diagram shows the 4D tensor flattened to 3D for visualisation: each cube has axes for batch, channel, and spatial (H and W collapsed into one axis). The shaded slab in batch normalisation is “all batches × all spatial positions” of one channel — that’s the slice over which mean and variance are computed. The result is channel-wise normalisation: each filter’s outputs get their own statistics, but those statistics aggregate information from every example and every pixel that channel produced.

Why batch norm helps

Empirically: faster convergence, less sensitivity to initialisation, mild regularisation effect (each example sees noisy statistics from its mini-batch peers, similar in spirit to dropout). The mechanism is gradient flow — well-scaled pre-activations keep activation derivatives in the useful regime, especially for ReLU.

Other types of normalisation

The “slab” picture generalises. Different choices of which dimensions to pool over give different normalisation schemes, useful when batch normalisation’s batch-level statistics are unreliable (small batches) or unwanted (style transfer).

SchemeStatistics computed overWhen it shines
Batch normBatch + spatial (H, W); per channelStandard CNN training with reasonable batch sizes
Layer normChannels + spatial; per sampleRecurrent networks, transformers — independent of batch size
Instance normSpatial only; per sample, per channelStyle transfer (each image’s style is normalised independently)
Group normSpatial + a manually chosen channel group; per sampleSmall-batch CNN training, where batch norm’s statistics get noisy

All four use the same formula plus learnable . Only the shape of the slab over which and are computed changes. Batch norm is the default for CNNs at typical batch sizes; the others exist for specific cases where batch norm’s reliance on batch statistics is a problem.

  • activation-functions — normalisation keeps inputs in the useful (non-saturating) regime of these
  • weight-initialization — addresses the same scale problem at ; normalisation maintains it through training
  • gradient-descent-variants — mini-batch SGD is the context where “batch” in batch norm makes sense
  • backpropagation — gradients flow through normalisation as just another differentiable operation
  • dropout — another technique with regularisation as a side effect; the two are commonly combined

Active Recall