ReLU Activation

2026-02-08

This is part of a series on ML for generalists, you can find the start here.

ReLU or Rectified Linear Unit, keeps the positive values from a layer's output and replaces the negative values with zeros. So if one of our convolutional layers had -0.5 somewhere in its output, the ReLU step would turn that to 0. If it outputted 2.5, that stays 2.5.

It's a simple step but fundamental to making our model actually work. The diagrams on the ReLU wikipedia page are so intimidating, I waited until now to link it, but the principle really is that simple.

Why does it matter? We need non-linearity for our model to learn.

If we didn't use ReLU, stacking multiple layers inside our model would be pointless. No matter how many layers we added, the result would be mathematically equivalent to a single layer. Our five nn.Conv2d layers would be mathematically equivalent to one nn.Conv2d (the same is true for our nn.Linear layers).

Say we have two rules:

1. Multiply by 4
2. Multiply by 3

If we apply both in sequence, we get:

1. Multiply by 12

You could always have had just one rule. There're no conditionals: any chain of adding or multiplying can be collapsed into one multiplication and adding operation.

A neural network without ReLU is essentially doing the same thing:

1. Multiply by weights
2. Add the results

🤓
Well, actually... Mathematically, this isn't the complete definition of linearity, but it's called nn.Linear not nn.Affine
When you stack two of those layers the maths works out that you could always find a single set of weights that would get the same result in one step. Same for three layers, same for a thousand layers.

With ReLU in between we break the chain, we get to add a conditional:

if negative, make zero. 

Now the two layers either side of our ReLU can't be collapsed together.

In ML terms, ReLU is our activation function. This comes from neuroscience. The neurons in our head don't fire continuously, they collect input signals and only when those signals go above a threshold do they fire or activate. Those activations act as signals to downstream neurons.

ReLU kind of mimics this: if the sum of inputs is below zero, ReLU keeps it quiet. When it's above, ReLU activates and passes the signal along.

ReLU isn't the only activation function we have available, it wasn't even first, but it's computationally cheap and works well in practice.

Generally, after any trainable layer.

(Layer usually means something with trainable parameters, but I have seen "ReLU layer" in some documentation, along with "ReLU step" and I think probably more precise "ReLU activation")

The only trainable layer you might want to skip ReLU on is the final or output layer:

self.fc = nn.Sequential(
    nn.LazyLinear(out_features=64),
    nn.ReLU(),
    nn.Linear(in_features=64, out_features=2),
)

In our model, the output of the final layer may be negative (sine or cosine) and we want to keep those values. We don't want to "rectify" negative cosine or sine to zero. That's why there's no final nn.ReLU() call.

ReLU is a conventional choice for our Convolutional Neural Network (CNN). There are alternatives though.

Leaky ReLU doesn't zero out negatives, it multiplies them by a small value, like 0.01 so -5 becomes -0.05. You might have neurons in your network that always output negative values regardless of input. Once that happens with ReLU a neuron's output will always be zero and it won't learn. These are called "dead neurons."

GELU has a smooth, gradual transition rather than a hard cut-off at zero. It's popular for transformer layers (like you'd find inside LLMs).

We've covered a lot. Here's a quick recap of everything we've covered, now it's training time!