Overfitting

2026-02-12

This is part of a series on ML for generalists, you can find the start here.

Our loss dropped, but our test accuracy didn't improve. What's going on?

Our model might be learning the right answers to give for our training images but not how to predict the right answers using general features of any image. This is overfitting.

There are some approaches we can take:

More training data: we saw this wasn't an issue when we had 800 training samples instead of 200
Make our model dumber by removing layers or making those layers smaller, so it has less capacity to learn our dataset
Make our model forget some data between each layer during training, so it's forced to find general patterns

First, let's confirm our suspicion and see if overfitting is the problem.

Test Accuracy vs Train Accuracy

If overfitting is happening, our model is getting really good at predictions on our training images but less good with images outside of that dataset. We should see high accuracy scores for our training dataset compared to our test dataset.

So far, we've only outputted accuracy stats for our test dataset. Let's quickly hack in a pass over the training dataset too:

    model.eval()
    within_5 = 0
    within_10 = 0
    within_20 = 0
    total = 0
    train_within_5 = 0
    train_within_10 = 0
    train_within_20 = 0
    train_total = 0
    with torch.no_grad():
        for images, labels in test_loader:
            outputs = model(images)

            predicted_angles = sincos_to_angles(outputs[:, 0], outputs[:, 1])
            true_angles = sincos_to_angles(labels[:, 0], labels[:, 1])
            angle_diff = torch.abs(predicted_angles - true_angles)
            angle_diff = torch.min(angle_diff, 360 - angle_diff)

            within_5 += (angle_diff < 5).sum().item()
            within_10 += (angle_diff < 10).sum().item()
            within_20 += (angle_diff < 20).sum().item()
            total += labels.size(0)
        
        for images, labels in train_loader:
            outputs = model(images)

            predicted_angles = sincos_to_angles(outputs[:, 0], outputs[:, 1])
            true_angles = sincos_to_angles(labels[:, 0], labels[:, 1])
            angle_diff = torch.abs(predicted_angles - true_angles)
            angle_diff = torch.min(angle_diff, 360 - angle_diff)

            train_within_5 += (angle_diff < 5).sum().item()
            train_within_10 += (angle_diff < 10).sum().item()
            train_within_20 += (angle_diff < 20).sum().item()
            train_total += labels.size(0)

    seconds_taken = time.monotonic() - start_time
    print(f"[Epoch {epoch + 1}/{num_epochs}] after {seconds_taken:.1f} seconds:")
    print(f"              loss: {avg_loss:.4f}")
    print(f"        -- TEST --")
    print(f"  within 5 degrees: {within_5 / total * 100:.2f}%")
    print(f" within 10 degrees: {within_10 / total * 100:.2f}%")
    print(f" within 20 degrees: {within_20 / total * 100:.2f}%")
    print(f"        -- TRAIN --")
    print(f"  within 5 degrees: {train_within_5 / train_total * 100:.2f}%")
    print(f" within 10 degrees: {train_within_10 / train_total * 100:.2f}%")
    print(f" within 20 degrees: {train_within_20 / train_total * 100:.2f}%")
    print()

Test vs Train Output

Here're the last epochs of a run with this extra output:

[Epoch 19/20] after 59.2 seconds:
              loss: 0.0015
        -- TEST --
  within 5 degrees: 54.00%
 within 10 degrees: 85.50%
 within 20 degrees: 98.00%
        -- TRAIN --
  within 5 degrees: 99.00%
 within 10 degrees: 100.00%
 within 20 degrees: 100.00%

[Epoch 20/20] after 65.1 seconds:
              loss: 0.0010
        -- TEST --
  within 5 degrees: 55.00%
 within 10 degrees: 86.00%
 within 20 degrees: 99.50%
        -- TRAIN --
  within 5 degrees: 100.00%
 within 10 degrees: 100.00%
 within 20 degrees: 100.00%

There's a big difference between the model's performance on data it's been trained with compared to data it hasn't seen before. Training data is 100% accurate while test data is sitting at 55% within our 5 degree range.

That's overfitting.

Dropout

Instead of permanently removing capacity, we can use dropout during the training process to randomly "turn off" some neurons.

self.fc = nn.Sequential(
    nn.LazyLinear(out_features=64),
    nn.ReLU(),
    nn.Dropout(0.25),
    nn.Linear(in_features=64, out_features=2),
)

During training, dropout randomly sets some neuron outputs to zero. In this case every neuron in our first fully connected layer has a 25% of being zeroed (nn.Dropout(0.25)). The model has to generalise and identify patterns, rather than relying on the outputs of any particular set of neurons.

It's a bit like taking good notes in a lecture. If you write down every word the instructor says, you'll have a perfect transcription but probably little understanding. Instead, writing more general notes helps you actually figure out what's going on.

nn.Dropout is enabled by our model.train() call and disabled with model.eval()

0.25 Forgetful

This gives:

[Epoch 19/20] after 76.3 seconds:
              loss: 0.0281
        -- TEST --
  within 5 degrees: 55.00%
 within 10 degrees: 88.50%
 within 20 degrees: 99.00%
        -- TRAIN --
  within 5 degrees: 75.50%
 within 10 degrees: 98.50%
 within 20 degrees: 100.00%

[Epoch 20/20] after 63.0 seconds:
              loss: 0.0231
        -- TEST --
  within 5 degrees: 61.50%
 within 10 degrees: 82.50%
 within 20 degrees: 99.00%
        -- TRAIN --
  within 5 degrees: 79.00%
 within 10 degrees: 98.00%
 within 20 degrees: 100.00%

Better. The gap between train and test accuracy is smaller, but still present.

Not Bad

We've got a working approach to reduce overfitting, but there are other architectural changes we could try:

More parameters: Sometimes a bigger model with dropout generalizes better than a smaller model without it. We can add more features per layer (32 -> 64 -> 128 could become 64 -> 128 -> 256), add more layers, or both.
Keep more data: Each MaxPool2d layer throws away information by downsampling. We could use fewer pooling layers, keeping more spatial information for the fully connected layers.

Where Next?

Dropout is reducing our overfitting problem and our model's performance is improving, but we know more training data makes this problem go away.

Let's look at ways of getting more from the data we have.