Convolutional Neural Networks

2026-02-06

This is part of a series on ML for generalists, you can find the start here.

The core ideas behind Convolutional Neural Networks (CNNs) go back to the late 1980s.

Yann LeCun published the foundational paper in 1989. This is the same Yann LeCun that was Meta's chief AI scientist from 2013 to 2025, leaving because he believes LLMs are a dead end towards "superintelligent" models.

Have fun, Yann.

There were some production uses but it wasn't until 2012 when AlexNet won the ImageNet competition (by a huge margin) that they came back around.

Don't feel bad if you haven't heard of them, this incarnation is newer than the music of One Direction.

The first layer in our model looks like:

nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1)

That kernel_size=3 parameter configures the layer to go through our entire image looking at a 3x3 pixel grid. It will create a 3x3 matrix of weights, multiply and sum those with each 3x3 pixel grid in our image and produce an output.

For example, here's a kernel that might detect horizontal edges: image

Previous versions of CNNs were actually hand-designed this way, where kernel values were set to identify particular features. Luckily, we don't have to do that. I'll cover how that particular magic works when we get to training.

Our kernel doesn't have to be 3x3, it could be 2x2 or 5x5 or whatever we'd like, but it's usually a square.

That kernel slides over the image and, at every position, produces an output. convolution step diagram, first kernel applied

Then we move one position over: convolution step diagram, second kernel applied

Repeating the process until we've covered the entire image: convolution process animation

If we look at the full output from sliding the kernel across our image, we can start to see the pattern in the data for ourselves: convolution steps completed

The horizontal edge appears as two rows of 765. I've chosen very obvious numbers for our source image: 0 and 255 to make the contrast easy to see. But even with more realistic pixels, we'd still get that "edge" of higher values in the middle of our feature map.

That's all it's doing, that's the core "convolution" mechanism.

Our first layer also has padding=1 -- what does this mean? Our kernel is a 3x3 matrix of weights so to calculate the output feature for each pixel it needs to "see" the full grid of 8 other pixels surrounding it. Without padding, we'll miss the edges of our image, because we don't have a full grid around those pixels.

padding=1 lets the kernel assume our image is surrounded by a border of 0 values. This snazzy animation should make it clearer: convolution process animation with 1 pixel of padding

That means, for every pixel in our input we calculate an output, we have a one-to-one mapping. The output from our kernel will contain 230,400 values (480x480).

For a kernel size of 3, the convention is padding of 1, for kernel size 5, padding of 2. Generally: padding = kernel size // 2

This is our image, in_channels=1 -- if we had an RGB colour image, we'd have in_channels=3

This is the number of kernels we'll run over our image. So far, I've talked about running a single kernel over the image. In practice, we want each layer to detect multiple features.

In our case we've set out_channels=32, so we'll run 32 kernels over the entire image and produce 7,372,800 output values (480x480x32).

So how do these kernels learn to detect features? We don't hand-design them anymore, we use backpropagation and gradient descent to train our entire network. I'll cover how that works later, but here's the quick version:

Each kernel starts with random weights. During training, the network makes predictions, compares them to the correct answers, and calculates how wrong it was. This is called the loss.

Backpropagation works backwards through the network figuring out how much each weight contributed to the error. Then gradient descent nudges each weight a tiny bit in whichever direction reduces the loss. Scored too high? Nudge it down a little. Scored too low? Bump it up a bit.

Over thousands of iterations, kernels that happen to detect useful features get reinforced. A kernel that randomly started with values vaguely resembling an edge detector gets pushed further in that direction, because detecting edges helps the network make better predictions.

Why don't all 32 kernels end up identical? Because they start with different random values.

Each kernel begins at a different starting point, so gradient descent pulls them toward different features. One kernel might specialise in horizontal edges, another in vertical edges, another in corners. The randomness at initialisation leads to differences in what each kernel learns to detect.

This is why we run multiple kernels per layer. We're betting that the random starting points, combined with gradient descent, will produce a diverse set of feature detectors. And it usually works!

Our initial model has 5 convolutional layers, with the output of one acting as the input for the next. You can see the out_channels matches the in_channels for the next layer:

nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1),
nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1),

🤓
Well, actually... There are other lines in our model that reduce this number in practice, we'll cover them a little later.
Our 480x480 grayscale image goes in, and 128 (channels) x 480 (height) x 480 (width) values come out of the last layer. That's 29 million values, up from our original 230,400 pixels.

Each layer is building on what the previous layer found. The first layer might find edges. The second layer, looking at those edge maps, might identify corners or curves. By the third or fourth layer, the network can start recognising things like horizons, walls, or the tops of heads -- features that hint at which way is up.

Five layers might be overkill for our task, but it's our starting point and working too well is my favourite kind of problem.

After training, our convolutional layers should have picked out features that hint at our image orientation. How do we get from those 29 million values to an answer? We need to get from "here are some features" to "rotate this image 73°".

That's where fully connected layers come in.