The architecture that made dense prediction practical. Take an encoder that compresses the image into an abstract feature representation, mirror it with a decoder that grows back up to full resolution, and connect each encoder layer directly to its decoder counterpart. The skip connections give the decoder access to fine-grained spatial detail that pooling threw away. Originally designed for biomedical image segmentation, now the dominant template for any per-pixel vision task.

The problem U-Net solves

For classification, a CNN takes an image and produces a single class label. The network can pool and downsample aggressively, because the output is one number per image. For semantic segmentation the output is one class label per pixel — a full label map for an input.

This creates a tension:

  • Pooling helps the network see context — global features, large-scale shapes, object identity — by pooling spatial detail into compact feature maps.
  • But segmentation needs fine spatial detail — pixel-precise edges of objects, exactly where one tissue ends and another begins.
  • Without pooling, the network can’t see the big picture — it stays at full resolution but every neuron’s receptive field is tiny, so it can’t recognise larger structures.

You can’t have it both ways with a single forward chain. U-Net resolves the tension by using both a path that pools (for context) and another path that preserves fine detail (for location), then combining them.

TIP — Forest and trees

Standard CNNs make you choose: pool aggressively and you see the forest (the object as a whole) but lose the trees (where its edges are); refuse to pool and you see the trees but never the forest. Segmentation needs both — to know it’s a tumour and to know exactly where its boundary is. U-Net’s encoder sees the forest; its skip connections preserve the trees; its decoder combines the two views into a per-pixel answer.

The U-shaped architecture

The architecture, drawn schematically, is a U:

  • Left arm (the encoder, “contracting path”): a standard CNN. Each level applies a couple of convolutions, then max pooling to halve spatial size. As resolution drops, the channel count grows: . By the bottom of the U, the feature map is small spatially but rich semantically — it knows what is in the image.

  • Bottom of the U (the bottleneck): the deepest, most abstract feature representation. Spatial size is small (say from a input), channel count is large (1024). This is where the network has the broadest view but the least spatial precision.

  • Right arm (the decoder, “expanding path”): mirror image of the encoder. Each level uses upsampling (or transposed convolution) to double spatial size, then a couple of convolutions. As resolution grows, channel count drops back down: . By the top, the spatial size matches the input (or close to it) and channel count is small.

  • Final layer: a convolution to turn the final feature map into a per-pixel class score map of shape , where is the number of classes. An argmax across the channel dimension gives the predicted class for each pixel.

The skip connections

The genius of U-Net is the skip connections from encoder to decoder, drawn as horizontal arrows across the U.

When the decoder upsamples a feature map, the result is blurry — it knows the object is roughly here, but the edges are smeared because pooling discarded fine spatial detail in the encoder. The fix: copy the corresponding high-resolution feature map from the encoder side and concatenate it to the upsampled decoder feature map (along the channel dimension), then apply a couple of convolutions to mix them.

Now the decoder has, at every resolution level:

  1. Deep, abstract features flowing up from the bottleneck — these tell it what the object is.
  2. High-resolution features flowing across from the corresponding encoder level — these tell it where exactly the boundaries are.

Subsequent convolutions learn to combine the two sources into a sharper output. The result is segmentation that’s both semantically correct (the network correctly identifies cells, tumours, road lanes) and spatially precise (the boundaries align with real edges in the image).

ASIDE — Concatenation vs addition

ResNet’s skip connections add the input to the output: . U-Net’s skip connections concatenate the encoder feature map to the decoder feature map along the channel dimension: . The role is similar (preserve information that would otherwise be lost), but the mechanism is different. Concatenation doubles the channel count and lets subsequent convolutions decide how to mix the sources; addition keeps the channel count fixed and forces the layers to learn a residual correction. Both are called “skip connections” in casual conversation; the distinction matters when reading architecture diagrams or implementing one.

The encoder-decoder framing

U-Net is one example of a more general pattern: the encoder-decoder architecture.

  • The encoder transforms the input into some compact, abstract representation — encoded features.
  • The decoder transforms this representation back into a full-resolution output of the desired form — segmentation map, generated image, transcribed text.

This pattern recurs across deep learning:

  • Autoencoders for unsupervised representation learning (encoder + decoder, target = input).
  • Variational autoencoders for generative modelling.
  • Transformers for sequence-to-sequence tasks (encoder + decoder, target = output sequence).
  • Image-to-image translation networks for tasks like colourisation, style transfer, super-resolution.

In all cases the bottleneck is where the most abstract representation lives. In U-Net specifically, the addition of skip connections distinguishes it from a plain encoder-decoder by making the decoder’s job easier.

Original cropping and modern variants

The original U-Net (Ronneberger et al., 2015) used valid convolutions (no padding) throughout, which meant each convolution shrank the spatial size slightly. As a result, the output of the original U-Net was smaller than the input — only the central region of the prediction was used; the borders were discarded. The skip connections also required cropping the encoder feature maps to match the (slightly smaller) decoder spatial size before concatenation.

Modern U-Net variants typically use same convolutions (with padding) so that the output matches the input size exactly, and skip connections concatenate without cropping. The original cropping is a historical detail you’ll see in some diagrams.

U-Net in practice

The architecture has been wildly successful in:

  • Medical imaging — its original domain. Cell nuclei segmentation, tumour boundaries, organ segmentation, retinal vessel extraction. Modest dataset sizes are typical here, and U-Net’s combination of context (via the encoder) and detail (via skip connections) is well-suited.
  • Self-driving perception — semantic segmentation of road, lane, vehicle, pedestrian classes per pixel.
  • Generative models — diffusion models for image generation use U-Net as the noise-prediction network.
  • Earth observation — segmenting satellite imagery into land-use categories.

Often paired with transfer learning: use an ImageNet-pretrained CNN (e.g., ResNet) as the encoder, train the decoder from scratch on the target segmentation task. This dramatically reduces the data requirement.

  • convolutional-neural-network — U-Net is a CNN architecture; the encoder is a standard CNN backbone
  • upsampling — the operation used in the decoder to grow feature maps back up
  • pooling — the operation used in the encoder to shrink feature maps
  • residual-connection — different flavour of skip connection (addition, not concatenation), often combined with U-Net
  • shift-invariance-equivariance — segmentation needs equivariance (output shifts with input), not just invariance, which is why fully convolutional designs work
  • transfer-learning — pre-trained encoders are commonly used to bootstrap U-Net training on small medical datasets

Active Recall