The opposite of pooling. Pooling shrinks a feature map; upsampling grows it. Needed whenever the output of a network must match the spatial size of its input — most importantly in segmentation, where every pixel needs a prediction. Three standard methods sit on a spectrum from cheap-and-fixed to learnable-and-expensive.

Why upsample at all

A standard CNN downsamples through pooling: . Spatial resolution drops, channel count rises, and the final layers see a coarse, abstract representation of the input. Fine for classification (one label per image).

But for dense prediction — semantic segmentation, depth estimation, image-to-image translation — the network’s output must be the same size as its input: a per-pixel label map of shape rather than a single class score. After downsampling abstract features, the network has to climb back to full resolution.

That’s the upsampling step. It takes a small spatial map (say, ) and produces a larger one (say, or ), filling in the new pixels by some chosen rule.

Nearest-neighbour upsampling

The simplest method: each pixel in the small map is copied into a block in the larger map.

For an input map upsampled :

InputOutput

Each input value becomes a block of ‘s in the output. No computation, no parameters — just memory rearrangement.

Effect: the output is blocky. Sharp edges and visible “Lego” pixelation, because every block has identical values. This is what you see when you naively zoom in on a small image in any image viewer.

Bilinear interpolation

Smoother. The output values are computed by interpolating between the input values, weighted by distance.

1D case first: linear interpolation

Two known values at position 0 and at position 1. To estimate the value at some intermediate position :

where (distance from ) and (distance from ). The closer the new point is to (small ), the more weight gets — note the swap: is multiplied by (the far distance), and vice versa. The intuition: if I’m right next to , my value should be mostly .

Concrete example: estimate a value 25% of the way from to . Then , , and — closer to 60 than to 100, as expected.

2D case: bilinear

“Bilinear” = “linear in each of two dimensions”. Apply the 1D formula twice — once horizontally, then once vertically (or vice versa; the result is the same).

For each new pixel:

  1. Use linear interpolation along the top edge to get a value above the new pixel.
  2. Use linear interpolation along the bottom edge to get a value below it.
  3. Use linear interpolation between those two values to get the new pixel.

The result is a smooth gradient — no blocky patches, but a continuous transition from each input pixel’s value to its neighbours’.

Effect compared to nearest neighbour

MethodOutput characterCostLearnable?
Nearest neighbourBlocky, “pixelated”NoneNo
BilinearSmooth, slightly blurredSmall (per-pixel weighted average)No
Transposed convWhatever the network learnedModerate (parameters and FLOPs)Yes

Bilinear gives noticeably better results than nearest neighbour on natural images and is the default in most modern segmentation networks. It has no learnable parameters — the interpolation weights are fixed by geometry.

Transposed convolution: learnable upsampling

The third option treats upsampling as the transpose of convolution. Instead of sliding a kernel across an input to produce a smaller output (regular convolution), slide a kernel that expands each input pixel into multiple output pixels, and the output is larger than the input.

The kernel weights are learnable, just like in regular convolution. The network can therefore learn whatever upsampling pattern best serves the loss — possibly something cleverer than nearest-neighbour or bilinear.

Caveat: transposed convolutions are notorious for producing checkerboard artifacts (regular grid-like patterns in the output), because of how the kernel’s overlap aligns with the stride. Modern alternatives often pair bilinear upsampling with a regular convolution layer afterwards, getting learnability without the artefact.

This module focuses on bilinear and nearest-neighbour upsampling; transposed convolutions are mentioned for completeness but their full mechanics aren’t required.

Align-corners: a subtle option

Both bilinear and nearest-neighbour upsamplers in PyTorch take an align_corners argument that affects how the input grid is mapped to the output grid:

  • align_corners=False (default for neural networks). Treats pixels as cells with extent, aligning their edges. The output corners are not exact copies of the input corners — the upsampling is uniform across the field but doesn’t preserve corner values.
  • align_corners=True (default for image processing). Treats pixels as points, aligning the corner pixels exactly. Better for visualisation; worse for learning because gradients near the edge get distorted.

Worth knowing exists; the exact semantics matter for reproducing results across libraries but rarely for correctness in standalone projects.

Where upsampling appears

The dominant use case in this module: the decoder half of a U-Net or other encoder-decoder segmentation architecture. After the encoder downsamples through several pooling layers to a low-resolution / high-channel bottleneck, the decoder progressively upsamples back to full input resolution, while combining (via skip connections) high-resolution information from the corresponding encoder layers.

Outside segmentation, upsampling appears in:

  • Generative models — generators in GANs, decoders in autoencoders.
  • Super-resolution — taking a low-resolution input and producing a sharper, higher-resolution output.
  • Image-to-image translation — colourisation, style transfer, depth estimation.
  • pooling — the downsampling operation that upsampling inverts
  • u-net — the canonical architecture using upsampling in its decoder
  • convolution — transposed convolution is a learnable alternative to fixed-rule upsampling
  • convolutional-neural-network — fully convolutional networks (FCNs) for segmentation are the architectural context

Active Recall