Overfitting

2026-02-12

This is part of a series on ML for generalists, you can find the start here.

Our loss dropped, but our test accuracy didn't improve. What's going on?

Our model might be learning the right answers to give for our training images but not how to predict the right answers using general features of any image. This is overfitting.

There are some approaches we can take:

First, let's confirm our suspicion and see if overfitting is the problem.

If overfitting is happening, our model is getting really good at predictions on our training images but less good with images outside of that dataset. We should see high accuracy scores for our training dataset compared to our test dataset.

So far, we've only outputted accuracy stats for our test dataset. Let's quickly hack in a pass over the training dataset too:

    model.eval()
    within_5 = 0
    within_10 = 0
    within_20 = 0
    total = 0
    train_within_5 = 0
    train_within_10 = 0
    train_within_20 = 0
    train_total = 0
    with torch.no_grad():
        for images, labels in test_loader:
            outputs = model(images)

            predicted_angles = sincos_to_angles(outputs[:, 0], outputs[:, 1])
            true_angles = sincos_to_angles(labels[:, 0], labels[:, 1])
            angle_diff = torch.abs(predicted_angles - true_angles)
            angle_diff = torch.min(angle_diff, 360 - angle_diff)

            within_5 += (angle_diff < 5).sum().item()
            within_10 += (angle_diff < 10).sum().item()
            within_20 += (angle_diff < 20).sum().item()
            total += labels.size(0)
        
        for images, labels in train_loader:
            outputs = model(images)

            predicted_angles = sincos_to_angles(outputs[:, 0], outputs[:, 1])
            true_angles = sincos_to_angles(labels[:, 0], labels[:, 1])
            angle_diff = torch.abs(predicted_angles - true_angles)
            angle_diff = torch.min(angle_diff, 360 - angle_diff)

            train_within_5 += (angle_diff < 5).sum().item()
            train_within_10 += (angle_diff < 10).sum().item()
            train_within_20 += (angle_diff < 20).sum().item()
            train_total += labels.size(0)

    seconds_taken = time.monotonic() - start_time
    print(f"[Epoch {epoch + 1}/{num_epochs}] after {seconds_taken:.1f} seconds:")
    print(f"              loss: {avg_loss:.4f}")
    print(f"        -- TEST --")
    print(f"  within 5 degrees: {within_5 / total * 100:.2f}%")
    print(f" within 10 degrees: {within_10 / total * 100:.2f}%")
    print(f" within 20 degrees: {within_20 / total * 100:.2f}%")
    print(f"        -- TRAIN --")
    print(f"  within 5 degrees: {train_within_5 / train_total * 100:.2f}%")
    print(f" within 10 degrees: {train_within_10 / train_total * 100:.2f}%")
    print(f" within 20 degrees: {train_within_20 / train_total * 100:.2f}%")
    print()

Here're the last epochs of a run with this extra output:

[Epoch 19/20] after 59.2 seconds:
              loss: 0.0015
        -- TEST --
  within 5 degrees: 54.00%
 within 10 degrees: 85.50%
 within 20 degrees: 98.00%
        -- TRAIN --
  within 5 degrees: 99.00%
 within 10 degrees: 100.00%
 within 20 degrees: 100.00%

[Epoch 20/20] after 65.1 seconds:
              loss: 0.0010
        -- TEST --
  within 5 degrees: 55.00%
 within 10 degrees: 86.00%
 within 20 degrees: 99.50%
        -- TRAIN --
  within 5 degrees: 100.00%
 within 10 degrees: 100.00%
 within 20 degrees: 100.00%

There's a big difference between the model's performance on data it's been trained with compared to data it hasn't seen before. Training data is 100% accurate while test data is sitting at 55% within our 5 degree range.

That's overfitting.

Instead of permanently removing capacity, we can use dropout during the training process to randomly "turn off" some neurons.

self.fc = nn.Sequential(
    nn.LazyLinear(out_features=64),
    nn.ReLU(),
    nn.Dropout(0.25),
    nn.Linear(in_features=64, out_features=2),
)

During training, dropout randomly sets some neuron outputs to zero. In this case every neuron in our first fully connected layer has a 25% of being zeroed (nn.Dropout(0.25)). The model has to generalise and identify patterns, rather than relying on the outputs of any particular set of neurons.

It's a bit like taking good notes in a lecture. If you write down every word the instructor says, you'll have a perfect transcription but probably little understanding. Instead, writing more general notes helps you actually figure out what's going on.

nn.Dropout is enabled by our model.train() call and disabled with model.eval()

This gives:

[Epoch 19/20] after 76.3 seconds:
              loss: 0.0281
        -- TEST --
  within 5 degrees: 55.00%
 within 10 degrees: 88.50%
 within 20 degrees: 99.00%
        -- TRAIN --
  within 5 degrees: 75.50%
 within 10 degrees: 98.50%
 within 20 degrees: 100.00%

[Epoch 20/20] after 63.0 seconds:
              loss: 0.0231
        -- TEST --
  within 5 degrees: 61.50%
 within 10 degrees: 82.50%
 within 20 degrees: 99.00%
        -- TRAIN --
  within 5 degrees: 79.00%
 within 10 degrees: 98.00%
 within 20 degrees: 100.00%

Better. The gap between train and test accuracy is smaller, but still present.

We've got a working approach to reduce overfitting, but there are other architectural changes we could try:

Dropout is reducing our overfitting problem and our model's performance is improving, but we know more training data makes this problem go away.

Let's look at ways of getting more from the data we have.