A cat is still a cat regardless of where it sits in the frame. CNNs handle this gracefully because their architecture has built-in symmetries: convolution layers are shift-equivariant (features move with the object), and pooling + FC layers eventually convert that into shift invariance (the final output is the same).

The two properties

Let be the function the network computes (input → output) and be the operator that shifts an image by vector .

Shift invariance: — shifting the input doesn’t change the output. The network gives the same answer regardless of where in the frame the content sits.

Shift equivariance: — shifting the input shifts the output by the same amount. The output structure tracks the input’s spatial structure.

These are different. Invariance throws away spatial information; equivariance preserves it. Both are useful, and CNNs exhibit each in different parts of the architecture.

When to want which

  • Classification wants invariance. “Is this a cat?” should yield “yes” regardless of where the cat appears. The output (a class label) has no spatial structure, so the network should be insensitive to spatial shifts of the input.
  • Segmentation wants equivariance. “Which pixels are cat?” must move when the cat moves — the output is a spatial map and has to align with the input. Invariance would be wrong (you’d lose the cat’s position).
  • Object detection wants both: the class should be invariant (cat is cat anywhere), but the bounding box should be equivariant (the box moves with the object).

Convolution is shift-equivariant — by construction

The convolution operation uses the same kernel at every spatial position. If you shift the input, the kernel still finds the same patterns, just in different output positions. Formally:

This isn’t a learned property — it falls out of the architecture’s weight sharing. Every conv layer has it. Stacking conv layers preserves it: a stack of equivariant layers is itself equivariant.

This is the technical reason why CNNs handle shifted objects well. An MLP, by contrast, has no such symmetry: every (input pixel, hidden unit) pair has its own weight, so the network learns position-specific detectors. Move a digit “7” from the centre to a corner and the MLP sees a totally different input. The CNN just sees the same features in different locations.

Pooling adds shift invariance

If a feature shifts by 1 pixel within a pooling window, the max value is unchanged — the pool output is identical regardless of the small shift. After a max pool, the network is approximately invariant to shifts of 1 pixel in the original feature map.

Stack several pool layers, and the invariant range grows: each pool layer absorbs shifts within its window. After 4 pool layers (each ), the network is invariant to shifts of up to roughly pixels in the original input — comparable to the spatial reduction the pools have produced together.

Average pooling provides the same kind of invariance: averaging over a window is approximately unchanged by small shifts inside it.

The FC head completes the invariance

The flatten + fully connected layers at the end of a CNN have no notion of spatial position whatsoever. By the time the spatial feature representation has been pooled down to (say) and flattened to a 25{,}088-dim vector, the FC layer just sees a long vector of feature activations. Whatever spatial information remained at that stage is now mixed across all positions.

So the architecture’s overall flow is:

  1. Conv stack — shift-equivariant feature extraction. Spatial positions of features track input.
  2. Pool stack — adds local shift invariance step by step, while still keeping a coarse spatial map.
  3. FC head — discards spatial position entirely; produces a single class score.

The output is approximately shift-invariant. The intermediate feature maps remain (approximately) shift-equivariant.

Why this matters: generalisation

A network without these symmetries has to learn shift-handling from scratch — by being shown the same object in many positions during training. This is data-hungry and never quite works.

A network with these symmetries by design generalises automatically: train on cats in the centre, and the network correctly classifies cats at the edge too, because the feature detectors aren’t position-specific. This is one of the reasons CNNs need much less data per class than MLPs would.

The general lesson is that architectural symmetries reduce the effective hypothesis space. The network can’t learn position-dependent classifiers because its weights are shared across positions — and that’s a good constraint, since position-dependence isn’t what we want anyway.

Beyond translation: other invariances

Translation (shift) is the cleanest invariance because it’s a direct consequence of weight sharing. But classification systems benefit from invariance to other transformations too:

TransformationInvariance idealHow CNNs partially achieve it
TranslationCat is cat anywhere in frameConvolution + pooling (architectural)
ScaleCat is cat whether near or farMulti-scale features (deeper layers see larger receptive fields); data augmentation (resize during training)
RotationCat is cat right-side-up or tiltedMostly learned via data augmentation (random rotations during training) — not architectural
Lighting / colourCat is cat in bright or shadowNormalisation (batch norm, input normalisation); data augmentation

Translation is the only invariance the architecture builds in for free. The others are achieved either by training-data tricks (augmentation) or by post-hoc normalisation. Some specialised architectures (group-equivariant CNNs, capsule networks) build other invariances directly into the weights, but they’re not standard.

Which operations preserve, build, or break equivariance

Different layers play different roles in the equivariance vs invariance trade-off:

OperationEffect on shift equivariance
Convolution (stride 1)Preserves equivariance exactly
ReLU / element-wise activationsPreserves equivariance (commutes with shift)
Batch normalisationPreserves equivariance (per-channel, position-independent)
Strided convolution ()Approximately preserves; only exact for shifts that are multiples of the stride
Max poolingApproximately invariant to small shifts within the window; breaks exact equivariance
Fully connected layerDestroys spatial structure; produces invariance
Global average poolingFully invariant — collapses spatial dims away

The architectural recipe is now visible: stack equivariance-preserving operations (conv + activation + maybe batch norm) to extract spatial features, then break equivariance with pool/FC/global-avg layers when the task no longer needs position information.

Designing for equivariance vs invariance

The distinction has practical implications when designing CNNs for non-classification tasks.

For segmentation, you want shift equivariance throughout. Strategies:

  • Use fully convolutional networks (FCNs) — replace FC layers with conv layers, so the output is a spatial map.
  • Avoid pooling, or pair pooling with upsampling (encoder-decoder structure like U-Net) so the output is at the same resolution as the input.

For classification, you want invariance at the end and equivariance in the middle. Standard CNN with conv → pool → FC is fine.

For detection, the architecture has multiple heads: one for class (invariant), one for bounding box coordinates (equivariant — the box should move with the object). The shared backbone is equivariant; the heads diverge.

A small caveat: discrete shifts

The textbook shift-invariance/equivariance definitions assume continuous translations. In practice, images are discrete, and convolutions with stride > 1 or pooling break exact equivariance — they only commute with shifts that are multiples of the stride/pool size. So real CNNs are approximately shift-equivariant, not exactly. The approximation is good enough that the basic intuition holds, but it’s worth knowing.

Active Recall