Backpropagation
2026-02-15
This is part of a series on ML for generalists, you can find the start here.
Backprop answers one question for every weight in your network: "how would the loss change if I tweaked this weight slightly?"
The answer is called the gradient. Once you have it, training is simple: nudge each weight in the direction that reduces the loss.
That's it! The rest is just working out the maths efficiently.
What Gets Updated
Only the weights and biases. Nothing else.
A Conv2d layer with a 3x3 kernel has 9 weights and 1 bias. Backprop works out a gradient for each of those 10 numbers. The optimiser updates them. Repeat until we're done.
The input data doesn't change. The architecture doesn't change. The activation functions don't change. Training is just adjusting those weights and biases.
The Chain Rule
Say you have two functions chained together:
z = f(x)
y = g(z)
You want to know: if I nudge x a tiny amount, how much does y change?
You can't skip straight from x to y because they're connected through z. But you can break the problem into two smaller questions:
- How much does
zchange when I nudgex? - How much does
ychange whenzchanges?
Multiply those two answers together and you're done.
That's the chain rule.
To find the effect of some input on the final output, multiply the effects along each step of the chain.
Neural networks are long chains of operations. The loss depends on layer 5's output, which depends on layer 4's output, which depends on layer 3's output, and so on back through the weights. Each link in the chain is a function, just like f and g above.
Backprop applies the chain rule repeatedly, passing gradient signals from layer to layer to get the gradient for parameters all through the network.
Each layer only needs to know two things:
- How much does my output change when my input changes?
- How much does the loss care about my output?
Multiply those together, pass the result towards the input side, and the next layer does the same. That's the whole algorithm.
In most diagrams, data flows left to right: input on the left, loss on the right. Backprop works in the opposite direction, from the loss (how wrong the answer was) back towards the input. You'll sometimes hear people say gradients flow "backwards" or "upstream." They all mean the same thing: from the loss side towards the input side.
Worked Example: One Conv2d Training Step
Let's work through an example with real numbers. Here's a complete training step for a tiny convolution layer.
Setup
- input is a 4x4 greyscale input image
- our layer has a 2x2 kernel (4 weights), no padding, no bias
- output is 3x3 (the kernel fits in 9 positions)
- our target is all zeros (the "correct answer" we're training our model to give)
- our loss function is the sum of squared outputs (similar to the MSE loss function we used before, just a bit simpler)
Forward Pass
This should look familiar:

Our 4x4 input:
10 20 30 40
30 40 50 60
20 30 40 50
10 20 30 40
The kernel starts with random weights:
0.5 0.3
0.2 0.1
At position (0,0), the kernel covers the top-left 2x2 patch:
10 20
30 40
Output at (0,0) = 10×0.5 + 20×0.3 + 30×0.2 + 40×0.1 = 5 + 6 + 6 + 4 = 21
The kernel slides to all 9 positions, producing a 3x3 output:

21 32 43
34 45 56
23 34 45
Loss
Each training example comes with a target, the correct answer the model should produce. For this image, to keep the maths nice and easy, the target is all zeros. Any non-zero output contributes to the loss.
Loss = 21² + 32² + 43² + ... + 45² = 13,341
Backward Pass
Now we work backwards to figure out how each weight in the kernel contributed to that loss.
Step 1: Loss gradient with respect to outputs
For squared loss, the gradient at each output position is 2 × the output value. This comes from calculus: the derivative of x² is 2x. If you haven't seen that before, just trust me for now. The important thing is what these numbers mean:

Think of them as "blame scores." The output at position (1,2) was 56, the largest value, so it gets the highest blame score of 112. The output at position (0,0) was only 21, so it gets a lower blame score of 42. These scores tell backprop where to focus its attention.
Step 2: Gradient for each kernel weight
Our 2x2 kernel looks like:
0.5 0.3
0.2 0.1
Take the top-left kernel weight (0.5). During the forward pass, it was multiplied by the top-left pixel of every patch the kernel visited. That means this one weight touched all 9 output positions. To find its gradient, we need to add up its contribution to each one.

At position (0,0), the top-left pixel was 10 and the blame score was 42:
contribution = 10 × 42 = 420
At position (0,1), the top-left pixel was 20 and the blame score was 64:
contribution = 20 × 64 = 1,280
The total gradient is the sum across all 9 positions:
gradient = sum(pixel value × blame score for each position)
= 10×42 + 20×64 + 30×86
+ 30×68 + 40×90 + 50×112
+ 20×46 + 30×68 + 40×90
= 22,080
That's a big number. It's saying: "this weight had a large, positive effect on the loss. If you want the loss to go down, make this weight smaller."
Each of the 4 kernel weights gets its own gradient, computed the same way using whichever pixel position that weight multiplied during the forward pass.
Update using Learning Rate
With a learning rate of 0.00001:
new_weight = old_weight - learning_rate × gradient
new_weight = 0.5 - 0.00001 × 22,080 = 0.279
Why subtract? Because the gradient points in the direction that increases the loss. We want the loss to go down, so we step in the opposite direction. The learning rate controls how big that step is.
The weight decreased from 0.5 to 0.279. The large positive gradient meant "this weight is pushing outputs up, but you want them down." Some weights shift more dramatically. The bottom-right weight flipped sign entirely, going from 0.1 to -0.178.
All 4 weights update simultaneously. The next forward pass produces smaller outputs, and the loss drops from 13,341 to 130.
Repeat thousands of times with many images.
Why It Works
The gradient tells the optimiser which direction to go. Large gradient means the weight has a big effect on the loss. Small gradient means the weight barely matters.
Over many training steps, a kernel weight that consistently activates on edges in the input will produce large outputs when edges are present. If those large outputs help reduce the loss (because edges are a useful feature for the model), the gradients will nudge that weight to activate even more strongly on edges. The weight gets reinforced.
Meanwhile, a weight that responds to random noise will sometimes increase the loss and sometimes decrease it. The gradients will push it in contradictory directions across different training examples, and it'll tend to shrink towards zero. The weight gets suppressed.
This is how the network learns without anyone telling it what to look for. Early on, every kernel is random and detects nothing meaningful. The gradients create a feedback loop: weights that happen to pick up on useful patterns get strengthened, weights that don't get weakened. Over thousands of iterations, random initial weights evolve into meaningful feature detectors.
Nobody designs the features. The gradients shape them.
PyTorch Tracks Operations
When we create a tensor with requires_grad=True, PyTorch records every operation we perform on it.
Each operation creates a node in a computation graph, a record of every operation that was performed and in what order:
- What operation was done
- What the inputs were
- How to compute the gradient for that operation
When we call loss.backward(), PyTorch walks this graph in reverse. Each node applies its own backward function ("I was a multiply, here's how gradients flow through me"). The chain rule falls out from this.
This system is called autograd. If you build a custom layer using basic PyTorch operations like indexing, multiplication, and addition, autograd tracks it automatically. You get correct gradients for free, no manual calculus required!
Backpropping Faster
The final post in this series is next, moving this from your CPU to your GPU.