TARGET DECK NeuralComputation::Week-04
Image representation
How is a colour (RGB) image represented as a tensor?
A RGB image is a tensor of shape — three stacked matrices, one per channel (red, green, blue). Each channel-pixel is typically 8 bits, giving million possible colours per pixel.
Why is the bit depth of an image significant when feeding it to a neural network?
Bit depth determines the range of intensity values: 8-bit gives , 16-bit gives . Large raw magnitudes propagate through the network and can cause activations to explode or sigmoids to saturate. Inputs are usually rescaled (e.g. divide by 255 to get , or Z-score normalise) before training.
Why MLPs fail on images
Why is a fully connected MLP a bad architecture for images?
Two reasons:
- Parameter explosion. A image fed into a fully connected hidden layer of units needs weights in just the first layer — far beyond memory, compute, and statistical feasibility.
- No translation structure. The MLP treats nearby pixels as independent and would have to learn every spatial location separately. A “7” in the corner is a completely different input from a “7” in the centre.
Convolution
What is the convolution operation in a CNN?
Slide a small kernel across the image and at each position compute the dot product between the kernel and the local image patch. The output map records the dot product at every spatial location. The same kernel weights are reused at every position — this is weight sharing.
Why can convolution be read as a similarity / feature-detection operation?
Convolution is a sliding dot product, and the dot product measures similarity between two vectors. A high value of means the local image patch at looks like the kernel pattern. So the kernel acts as a learnable template for a feature (edge, corner, texture), and the output map records where in the image that feature appears.
Why do CNNs need so many fewer parameters than MLPs for images?
Weight sharing. A convolution layer uses the same kernel at every spatial position, so the parameter count scales with kernel size depth number of filters — not with image size. A filter on a 3-channel input has weights regardless of whether the image is or .
Output size, padding, stride
State the convolution output-size formula.
is the input width, is the kernel size, is the padding (zeros added at each border), and is the stride. The same formula applies to height. Output depth equals the number of filters, regardless of input depth.
A image is convolved with a filter, stride 1, no padding. What size is the output?
. Output is . Two pixels of border are lost on each side because the kernel cannot centre on boundary pixels.
For an odd kernel, what padding gives same convolution (output spatial size = input)?
So needs , and needs . This is why VGG uses kernels with everywhere.
What goes wrong if the convolution output-size formula yields a non-integer?
The chosen combination does not tile the image cleanly — the kernel runs off the edge with leftover pixels that have no place to go. You either lose information at the borders or have to handle it specially. Always pick so the formula is integer.
Convolution layer
Why must a convolution filter's depth match the input's depth?
The filter performs a dot product across the entire local volume — both spatial and depth dimensions. If the input has channels, each spatial position holds a -dimensional vector, so the filter needs one weight per (spatial offset, channel) triple. The spatial size is a design choice; the depth is fixed by the input.
VGG16's first conv layer takes a input and applies 64 filters of size . How many learnable parameters?
Each filter has weights plus 1 bias = 28 parameters. With 64 filters: . Compare to a fully connected layer of the same output size, which would need weights — five orders of magnitude more. That gap is the weight-sharing payoff.
Activations
What is ReLU, and why does it dominate CNN hidden layers?
ReLU is non-saturating for (gradient is 1, not vanishing), computationally trivial, and converges much faster than sigmoid (~6× in some benchmarks). Its only weakness is “dying ReLU” — neurons that always output 0 stop learning — which leaky ReLU avoids by keeping a small slope on the negative side.
Compare sigmoid, tanh, ReLU, and leaky ReLU in one table.
Activation Range Saturates? Sigmoid Both ends Tanh Both ends ReLU Only at 0 (can “die”) Leaky ReLU Never
Pooling
What is max pooling, and what is its default configuration?
Slide an window over the input and emit the maximum value within the window at each position. Default — halves spatial dimensions in both axes. Pooling has no learnable parameters and operates independently per channel, so depth is unchanged.
What two roles does pooling play in a CNN?
- Compute / memory savings. Smaller spatial dimensions mean fewer ops in later layers and far fewer parameters in any subsequent FC layers.
- Shift invariance. Small input translations leave the pooled output approximately unchanged, since the max over a window is robust to a feature shifting by a pixel or two.
Shift equivariance vs invariance
What is the difference between shift equivariance and shift invariance, and which CNN component provides each?
- Shift equivariance (convolution): if the input shifts by , the output of a convolutional layer shifts by too — the features are detected the same way, just at moved positions.
- Shift invariance (pooling + FC): even when a feature appears at a different location, the final classification stays the same.
Convolution tracks where a feature is; pooling and FC eventually disregard where and ask only what.
CNN architecture and parameter count intuition
Why do most VGG / modern CNN parameters live in the fully connected layers, not the convolutional ones?
Convolution layers share weights across spatial positions, so each layer only stores kernel weights. The first FC layer must connect every element of the final feature map to every hidden unit — for VGG16 that is inputs into 4096 units, weights in one layer. Small kernels in the conv stack do most of the work; the FC layers carry most of the parameters.
Hierarchically, what kinds of features do early, middle, and final conv layers tend to learn?
- Early: low-level — edges, corners, simple textures.
- Middle: mid-level — eyes, wheels, leaves (compositions of edges).
- Final: high-level — faces, cars, whole objects (compositions of parts).
This hierarchy emerges naturally from training and matches what neuroscience reports about the visual cortex.