An image is just a multi-dimensional array of numbers. The shape and the bit depth determine how much memory it occupies — and how badly an MLP scales when you try to feed it one.

Pixels are integers (mostly)

A digital image is a grid of pixels. Each pixel stores an intensity value as a fixed number of bits. The number of bits is called the bit depth, and it determines how many distinct intensities the pixel can represent.

Bit depthBytes per pixelDistinct valuesTypical use
8 bit1Standard photos, web images
16 bit2Scientific imaging, microscopy
32 bit float4(continuous)Computations, HDR, scientific

For 8-bit images, intensities are integers in — the canonical encoding for standard photographs. 16-bit gives much finer gradations, which matters in scientific imaging where small differences in intensity carry meaning. 32-bit floating point stores approximate real values using IEEE 754 — useful when computations push values outside the integer range or need fractional precision.

ASIDE — Why 8 bits is enough for photos

Human visual perception can only distinguish about 100–200 intensity levels in a single image (depending on lighting and contrast), so 256 levels comfortably exceeds what the eye notices. Scientific imaging uses 16-bit because instruments like microscope cameras can detect signal differences far smaller than the eye can — and you don’t want to throw that information away during digitisation.

Greyscale: a 2D matrix

A greyscale image of width and height is a 2D matrix of shape . Each entry is one pixel’s intensity. Memory cost = .

A 16-bit greyscale image (a typical microscopy frame) is bytes ≈ 512 KB.

Colour: stack greyscale images

Colour images use multiple channels. The standard is RGB — three channels for red, green, blue — stacked into a 3D tensor of shape . Each channel is essentially a greyscale image showing how much of that primary colour is at each position. Each channel-pixel is typically 8 bits, giving a “24-bit colour” image with million possible colours per pixel.

ASIDE — Why RGB? Trichromatic vision

The choice of three channels isn’t arbitrary — it matches the human eye, which has three types of cone cell sensitive to short (S, blue), medium (M, green), and long (L, red/yellow) wavelengths. Any visible colour can be approximated by mixing red, green, and blue light in the right proportions, which is why RGB monitors work. Other colour spaces exist (HSV, LAB, CMYK), but RGB is the default in computer graphics because it maps directly to display hardware.

A few useful RGB tuples:

ColourRGBNotes
BlackNo light
WhiteMaximum of every channel
RedPure red
YellowRed + green = yellow
CyanGreen + blue = cyan
GreyEqual mid-intensity in all channels

Why this matters for neural networks

The input to a neural network is the image flattened into a vector — or, in CNNs, the image kept as a 3D tensor. Either way, the number of input components is what matters for the first layer’s parameter count.

For a small greyscale digit (MNIST), a fully connected first layer with even just 1000 hidden units already needs weights — large but tractable.

Now scale up:

Image sizeChannelsInputsFC layer to 1M hidden units
1784 weights
3 (RGB)150{,}528 weights
33{,}000{,}000 weights

The trillion-weight numbers aren’t trainable — they exceed the memory of every modern accelerator and there isn’t enough labelled data on the planet to fit them. A fully connected MLP simply does not scale to image inputs.

This is the bottleneck that motivates the entire CNN architecture: instead of one weight per (pixel, hidden unit) pair, share a small kernel of weights across every spatial location. See convolution for the operation, convolutional-neural-network for the resulting architecture.

Image tensors in code

In PyTorch, images are typically stored as tensors of shape (C, H, W) or (N, C, H, W) for batches:

  • N — batch size (number of images)
  • C — channels (1 for greyscale, 3 for RGB)
  • H — height
  • W — width

Note the channel-first convention. Some libraries (TensorFlow’s older API, NumPy in some contexts) use channel-last, (N, H, W, C). When porting code, check which convention each library expects.

Pixel intensities for 8-bit images are integers when loaded, but neural networks usually want floats normalised to (divide by 255) or to mean 0, std 1 (subtract per-channel mean, divide by per-channel std). Normalisation is part of standard preprocessing.

  • convolution — the operation that lets networks process images without exploding parameter counts
  • convolutional-neural-network — the architecture built on top of convolution
  • multi-layer-perceptron — the architecture that cannot handle images at scale
  • perceptron — the underlying neuron, whose dot product gets reused at every spatial position in a convolution

Active Recall