A CNN replaces the MLP’s fully connected hidden layers with stacked convolution layers, plus pooling for downsampling. The result is a network that scales to images by sharing weights across spatial positions, while still learning a hierarchy of features end-to-end.

What a CNN looks like

A canonical CNN has three kinds of layer in roughly this order:

  1. Convolution layers — many filters per layer, often paired with ReLU activation. These do the feature extraction. See convolution.
  2. Pooling layers — periodic downsampling (typically max pooling, stride 2). See pooling.
  3. Fully connected (FC) layers — at the end, after flattening the spatial structure. These produce the final class scores, with softmax on the output for multi-class classification.

A typical pattern: [CONV → ReLU → CONV → ReLU → POOL] × N → FLATTEN → [FC → ReLU] × M → FC → SOFTMAX. The conv-pool block repeats times to extract increasingly abstract features at decreasing spatial resolution; the FC head turns the final feature representation into class scores.

At the level of feature volumes, the picture is a chain of decreasing-spatial / increasing-depth blocks: the input () is convolved with several filters and a ReLU to give a activation map; that’s convolved with filters to give ; and so on. Spatial dimensions shrink (by valid convolution or pooling); the channel dimension grows because each layer applies many filters in parallel.

A complete CNN pipeline (here from a input) makes the alternation visible: convolution layers (Conv1 … Conv4) interleaved with subsampling/pooling layers (Sub1 … Sub3), then a fully connected head (FC5, FC6) producing the final class scores. Spatial resolution drops from → … → ; channel depth rises from 1 → 80 → 96 → 128 → 160. The annotations show how the kernel size shrinks deeper into the network — large kernels early to cover lots of input pixels at first, smaller kernels later because pooling has already enlarged each neuron’s receptive field.

The convolution layer

A convolution layer is the operational unit of a CNN. It wraps the convolution operation in three additional ideas:

  1. Multiple filters in parallel. A single filter detects one type of feature. Real layers use filters, each looking for a different pattern. Their outputs stack along the depth dimension to give a 3D output.
  2. Filter depth matches input depth. If the input has channels, every filter has shape . The dot product is computed across both spatial and depth dimensions, so each filter outputs a 2D map.
  3. Activation function applied per output. After the dot product (and a per-filter bias), an activation like ReLU. The result of the layer — conv → activation — is called an activation map (or feature map).

For input and filters of size with stride and padding , the output volume is where the magic formula gives (and similarly for ). The depth of the output equals the number of filters, set by the layer designer.

TIP — Two depth rules to keep straight

The depth dimension causes confusion. Two rules together fix it:

  1. The filter’s depth is not a choice — it equals the input’s depth. If the previous layer’s output has depth , every filter in this layer has shape . The depth dimension is consumed by the per-filter dot product.
  2. The output’s depth is a choice — it equals the number of filters . You decide how many filters to apply; each one produces a 2D activation map; stacking them gives output depth .

So the depth pattern is: (fixed by previous layer) → (your choice this layer) → becomes the next layer’s input depth, fixing its filter depth. The number of filters is a hyperparameter at every layer; the filter depth is implied.

Counting parameters in a conv layer

Each filter has weights (one per element of the local volume) plus 1 bias. With filters:

The crucial observation: this number does not depend on or . Doubling the image size doesn’t double the parameter count of a conv layer — it just produces a larger output volume. That’s the weight-sharing payoff.

Why multiple filters

A single filter can detect a single type of pattern — say, vertical edges. Real images contain many feature types: vertical edges, horizontal edges, diagonal edges, dots, corners, textures. A layer needs multiple filters in parallel to extract all of them.

This is the only way the network can build a rich feature representation. With one filter, the next layer sees only “where vertical edges are”. With 64 filters, it sees “where vertical edges are, where horizontal edges are, where dots are, …” — a much richer 64-channel description that subsequent layers can compose into more complex features.

ASIDE — The extreme case: convolutions

A kernel sounds useless — it has no spatial extent, just a single weight per input channel. But its depth is still the full input depth , so it performs a -dimensional dot product per spatial position. In other words: a convolution mixes information across channels at each pixel, without combining neighbouring pixels.

Concrete shape arithmetic: a input convolved with 32 filters of size gives a output. Each filter performs a 64-dimensional dot product per pixel. Spatial dimensions are unchanged; the channel count goes from 64 → 32.

This is surprisingly useful. It lets a layer change the channel count cheaply (e.g., reduce 256 channels to 64 before an expensive convolution — the “bottleneck” trick used in ResNet and Inception), and it lets the network learn channel-wise feature combinations without spatial pooling. They look trivial but are a workhorse in modern architectures.

Stacking layers: the feature hierarchy

Putting many conv layers in a row produces a hierarchy of features with two key properties:

  • Increasing receptive field. Each successive layer’s neurons look at a wider patch of the original image. After many layers, a single neuron near the output can see the whole input.
  • Increasing abstraction. Layer 1 detects low-level features (edges, dots) directly from pixels. Layer 2 composes those into mid-level features (corners, eyes, wheel arcs). Layer 3+ composes those into objects (faces, cars).

TIP — What "receptive field" means

The receptive field of a neuron is the patch of the original input image that influences its value. It’s an answer to “if I changed a pixel of the input, would this neuron’s value change?” — the receptive field is the set of input pixels for which the answer is yes.

  • Layer 1 with kernels: each output neuron sees a patch of the input. Receptive field = 3×3.
  • Layer 2 with kernels (no pooling): each layer-2 neuron sees a patch of layer-1 activations, which themselves came from patches of the input. Together, a layer-2 neuron sees a patch of the original input. Receptive field = 5×5.
  • Layer 3: another step of adds 2 more on each side. Receptive field = 7×7.

Each pure conv layer with kernel size adds to the receptive field. Pooling enlarges it much faster: a pool with stride 2 doubles the receptive field of every subsequent layer (because each layer-3 neuron now sees a patch of post-pool activations, which corresponds to a patch of pre-pool activations). After several pool layers, deep neurons have receptive fields covering most or all of the input.

Why it matters: the receptive field is what the neuron “can see”. A neuron with a receptive field can detect edges; a neuron with a receptive field can detect parts of objects; a neuron whose receptive field covers the entire image can detect whole objects. The hierarchy of features is built by progressively enlarging the receptive field through depth and pooling.

This composition isn’t programmed — it emerges naturally from training on classification. Visualisations of trained CNNs consistently show:

  • Early layers: edges, oriented gradients, dark spots, simple textures.
  • Middle layers: eyes, ears, wheels, parts of objects.
  • Late layers: faces, recognisable objects, semantic concepts.

This matches the structure of the visual cortex in animals — V1 detects edges, V4 detects parts, IT detects objects. The CNN reproduces this hierarchy spontaneously because both systems are doing the same thing: solving a hard recognition problem by composing simple detectors.

Why pooling sits between conv blocks

Three reasons to interleave pooling:

  1. Compute and memory. Halving spatial dimensions cuts activation count by , making deeper networks feasible.
  2. Shift invariance. Within each pool window, small spatial shifts are absorbed by the max operation. After several pool layers and the FC head, the final classification is largely insensitive to where in the image the object sat. See shift-invariance-equivariance.
  3. Larger receptive fields. Each pool layer doubles the effective receptive field of subsequent conv layers — a kernel after a pool covers a patch of the pre-pool input, and a patch of the original input after another pool. This lets later layers see the bigger picture.

There’s no rigid rule about where pools go. Some networks pool after every conv (early CNNs); some pool after every two or three convs (VGG); some skip pooling and use strided convolutions instead (ResNet).

The fully connected head

After the conv stack, the spatial feature representation has to become a class score. Two steps:

Flattening. The final feature volume is reshaped into a single vector of length . This loses spatial structure entirely (treating the whole feature volume as one long input vector to the FC layers).

Fully connected layers. One or more standard MLP-style layers, often with ReLU between them, ending in a layer with one neuron per class and softmax to produce probabilities.

The FC layers tend to dominate the parameter count in classical CNNs. In VGG16, the FC layers contain ~80% of the network’s 138M parameters even though they’re only 3 of 16 weight-bearing layers. This is the trade-off of the FC head: learnable but very parameter-heavy.

ASIDE — Replacing FC with global average pooling

Modern architectures (ResNet, EfficientNet) often use global average pooling before the final FC layer instead of flattening: average each channel of the final feature map across all spatial positions, producing a -dim vector. Then a single FC layer projects to the class scores. This drastically cuts parameters and tends to generalise better. The basic CNN we study still uses flatten + FC for clarity; just know the alternative exists.

Tensor size vs parameter count

Two different quantities people often confuse — and you’ll be asked about both in exams. They measure different things.

QuantityWhat it countsWhere it lives
Tensor sizeHow many activation values exist at a layer’s outputIn memory, recomputed every forward pass
Parameter countHow many learnable weights the layer hasIn the model file, fixed across all forward passes

Intuitive picture: tensor size is “how many numbers come out of this layer per image”; parameter count is “how many numbers does the network remember about this layer”. Tensor size scales with input size (a bigger image produces bigger feature maps); parameter count does not (the kernel is the same size whatever image you feed it).

Counting tensor size

Tensor size = — width × height × depth of the output volume. Just multiply.

For a conv layer with input , applying filters of size with stride and padding :

(Same formula for .) For pool layers, no filter count: depth is preserved, only spatial dimensions shrink.

Counting parameters

The intuition is “count the numbers inside the kernel-block, multiply by how many kernels”:

Layer typeParametersIntuition
ConvOne filter has weights (shape of the local volume) plus 1 bias; such filters.
PoolPooling has no learnable weights — it’s a fixed function.
FCEvery input connects to every output (weight per pair) plus one bias per output neuron.
Activation (ReLU etc.)Element-wise function, no weights.

Notice what’s not in any of the conv-layer formulas: and . A conv layer’s parameter count doesn’t depend on the input image’s spatial dimensions — that’s the weight-sharing payoff. Only the kernel size , input depth , and number of filters matter.

For FC layers, in contrast, the parameter count does depend on input size — every input value connects to every output, so doubling the input doubles the parameters. This is why the FC head dominates parameter counts: by the time the conv stack flattens, the input vector is huge.

Worked example: a single conv layer

Take a conv layer with input , applying 16 filters of size with stride 1 and padding 2.

Tensor size: . Output: = 16{,}384 activations per image.

Parameter count: Each filter has parameters. With 16 filters: 1{,}216 parameters total.

Two very different numbers despite both being “about” this one layer. The activations grow with image size; the parameters don’t.

Worked example: a single FC layer

Take an FC layer mapping a 25{,}088-dim input vector to 4{,}096 output units (this is VGG16’s first FC layer, after flattening ).

Tensor size: Just the output dimension. Output: activations per image.

Parameter count: 103M parameters in this single layer.

Compare: VGG16’s thirteen conv layers together total ~15M parameters; this one FC layer alone is ~103M.

TIP — Why FC layers explode parameter counts

Every FC neuron has its own private weight to every input value. There’s no sharing. So when the input vector is huge (as it is right after flattening a deep feature volume), the parameter count is essentially “input size × output size” — which scales linearly in both. Conv layers escape this by reusing the same kernel everywhere; FC layers cannot, because they have no spatial structure to share over. This is why modern architectures (ResNet etc.) use global average pooling before the FC head — it collapses the spatial dimensions so the FC layer’s input is tiny.

Two case studies

AlexNet (2012)

The breakthrough that put CNNs on the map. Won ImageNet 2012 by a huge margin (16.4% error vs the previous year’s 25.8%), kicking off the modern deep-learning era.

  • Input: .
  • 5 convolution layers with 96, 256, 384, 384, 256 filters (kernel sizes 11, 5, 3, 3, 3).
  • 3 max-pooling layers (after conv1, conv2, conv5).
  • 3 fully connected layers: 4096, 4096, 1000.
  • Total: ~60M parameters.

Two things AlexNet did that we’ve already absorbed: ReLU activations everywhere (instead of sigmoid/tanh) and training on GPUs.

VGG16 (2014)

Two years later, a much deeper and more uniform design. Won ImageNet 2014 with 7.3% error.

  • Input: .
  • 13 convolution layers, all with padding 1.
  • 5 max-pooling layers (after every block of 2-3 convs), each halving spatial size.
  • Filter counts double per block: 64 → 128 → 256 → 512 → 512.
  • Spatial size halves per pool: .
  • 3 fully connected layers: 4096, 4096, 1000.
  • Total: 138M parameters (roughly 2× AlexNet despite being deeper).

VGG’s lesson: small kernels stacked deep is more parameter-efficient than large kernels at any single layer. Two stacked convolutions cover a receptive field with fewer parameters and an extra non-linearity in between.

Walking through VGG16’s layer-by-layer parameter count

LayerOutput shapeParameters
Input0
conv1: 64 × 3×3
conv2: 64 × 3×3
pool: 2×2/20
conv3: 128 × 3×3
conv4: 128 × 3×3
pool: 2×2/20
pool (final)0
FC1
FC2
FC3

The FC1 layer alone has ~103M of the 138M total parameters — by far the biggest chunk. The 13 conv layers together account for only ~15M.

TIP — Where parameters live

Convolution layers do most of the work but have few parameters per layer (weight sharing). FC layers have many parameters per layer (no sharing). The “where do the parameters live” answer for almost any classical CNN: in the FC head. This is precisely why modern designs reduce or eliminate the FC head — it’s the highest-parameter, lowest-leverage part of the architecture.

ResNet (2016) — when stacking layers stops working

VGG showed that deeper helped, up to a point. Going much beyond ~20 layers, plain CNNs run into the degradation problem: training error gets worse as depth increases, despite the deeper network being strictly more expressive in theory. The fix is a small structural change — residual connections that add the input of each layer block back to its output. ResNet-34 (34 layers, ~21M parameters) matches or beats VGG19 (~143M parameters) on ImageNet, and the same architectural primitive scales to 50, 100, even 1000 layers. See residual-connection for the why.

Training a CNN

Same as any other neural network: forward pass to compute predictions and loss, backward pass to compute gradients, gradient descent to update weights. See backpropagation and gradient descent.

Two CNN-specific considerations:

  • Backprop through convolution. A weight is used at every spatial position (weight sharing), so its gradient is the sum of contributions from every position — exactly the multivariate chain rule from week 3, applied automatically. Implementation-wise, the gradient of a convolution turns out to be another convolution (with a flipped kernel and rearranged inputs), which is why GPU-optimised convolution kernels are central to CNN training.
  • Backprop through pooling. No parameters, but gradient routing matters. Max pooling routes the gradient only to the position that contained the max; average pooling spreads the gradient equally. See pooling.

CNNs aren’t only for classification

The conv layers extract a feature representation; what you do with it depends on the task.

  • Classification: flatten + FC + softmax (the canonical pipeline).
  • Classification + localisation: two FC heads on the feature representation — one for class scores (softmax loss), one for box coordinates (L2 loss). Add the losses.

The same conv backbone produces a feature vector; one FC head predicts the class label (cat) using softmax + cross-entropy, another predicts the bounding box as a regression problem with L2 loss. Localisation is just regression dressed up — the loss is the squared distance to the ground-truth box.

  • Object detection: more complex (Faster R-CNN, YOLO), but the CNN backbone is the same. See shift-invariance-equivariance for why a backbone trained for classification transfers to detection.
  • Semantic segmentation: replace the FC head with more conv layers, often with upsampling, to produce a per-pixel output. Fully convolutional networks (FCNs) are the natural design; the U-Net is the canonical encoder-decoder elaboration with skip connections from encoder to decoder.

In an FCN, there are no FC layers at all — the input is convolved end-to-end into a score volume (one channel per class), and an argmax across channels gives a per-pixel class prediction. The cross-entropy loss is computed pixel-wise against a ground-truth segmentation mask. The same conv operation that classified whole images now classifies every pixel.

The CNN architecture is a feature extractor — once trained, it’s task-agnostic to a meaningful degree. This is why “CNN backbone” is a standard term: the same conv layers, with different heads, solve many vision tasks.

Active Recall