Training Our Model

2026-02-09

This is part of a series on ML for generalists, you can find the start here.

This is the exciting bit, we're going to train our model using the dataset we generated earlier.

There's a lot of code but don't be intimidated, about half is output to let us watch the training process a little more closely. This is pretty common when you're designing and building (and troubleshooting) models.

If you've made it this far, you've understood the most difficult concepts. This is a bit verbose, that's all.

Code

We'll keep everything in train.py for now. If imports scattered throughout the code are hurting your brain, you can download the complete train.py here.

from torch import optim
from torch.utils.data import DataLoader
import time

train_dataset = OrientationDataset(Path("data-train/answersheet.json"))
train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True,
    num_workers=0,
)

test_dataset = OrientationDataset(Path("data-test/answersheet.json"))
test_loader = DataLoader(
    test_dataset,
    batch_size=32,
    shuffle=False,
    num_workers=0,
)

num_epochs = 10
learning_rate = 0.001

model = OrientationModel()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss()


def sincos_to_angles(sines: torch.Tensor, cosines: torch.Tensor) -> torch.Tensor:
    return torch.atan2(sines, cosines) * 180 / math.pi


for epoch in range(num_epochs):
    start_time = time.monotonic()
    running_loss = 0.0
    model.train()
    for images, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    avg_loss = running_loss / len(train_loader)

    model.eval()
    within_5 = 0
    within_10 = 0
    within_20 = 0
    total = 0
    with torch.no_grad():
        for images, labels in test_loader:
            outputs = model(images)

            predicted_angles = sincos_to_angles(outputs[:, 0], outputs[:, 1])
            true_angles = sincos_to_angles(labels[:, 0], labels[:, 1])
            angle_diff = torch.abs(predicted_angles - true_angles)
            angle_diff = torch.min(angle_diff, 360 - angle_diff)

            within_5 += (angle_diff < 5).sum().item()
            within_10 += (angle_diff < 10).sum().item()
            within_20 += (angle_diff < 20).sum().item()
            total += labels.size(0)

    seconds_taken = time.monotonic() - start_time
    print(f"[Epoch {epoch + 1}/{num_epochs}] after {seconds_taken:.1f} seconds:")
    print(f"              loss: {avg_loss:.4f}")
    print(f"  within 5 degrees: {within_5 / total * 100:.2f}%")
    print(f" within 10 degrees: {within_10 / total * 100:.2f}%")
    print(f" within 20 degrees: {within_20 / total * 100:.2f}%")
    print()

torch.save(model.state_dict(), "orientation-model.pth")

Run the code and see what happens.

Output

On my machine (an aging Intel Macbook Pro), my CPU fan spins up in about 30 seconds and the entire training run takes about 10 minutes.

Here's what my first minute of output looks like:

% python train.py
/Users/brian/src/whichway/.venv/lib/python3.12/site-packages/torch/nn/modules/lazy.py:181: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
  warnings.warn('Lazy modules are a new feature under heavy development '
[Epoch 1/10] after 56.8 seconds:
              loss: 0.4883
  within 5 degrees: 20.40%
 within 10 degrees: 37.20%
 within 20 degrees: 67.20%

The warning comes from our use of nn.LazyLinear and I'm pinned to earlier versions of PyTorch because of my CPU architecture.

It's Just Different, Not Scary

Even if you're a seasoned Python programmer, some of this code might look a bit odd. Don't worry, there are some conventions in PyTorch (and other data science libraries) that take a little time to get your head around, but they make sense and let you take full advantage of the hardware in your machine.

We're running everything on our CPU right now, but it will take about 4 lines of code to move this to your GPU.

Let's step through each section of code, piece by piece.

DataLoaders

We load both our training and test datasets using the OrientationDataset class we built earlier. The DataLoader wraps a dataset and makes it consumable by the rest of PyTorch.

train_dataset = OrientationDataset(Path("data-train/answersheet.json"))
train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True,
    num_workers=0,
)

test_dataset = OrientationDataset(Path("data-test/answersheet.json"))
test_loader = DataLoader(
    test_dataset,
    batch_size=32,
    shuffle=False,
    num_workers=0,
)

Why two datasets? We'll use train_dataset to actively train the model and we'll use test_dataset to evaluate how accurate our model is after each run through the training data.

When we're training a model, we might overfit. That's when the model has learned or memorised the training data, rather than learning a general solution to our problem. So our training data gets amazingly high scores but when we give it something new, the model performs terribly.

We can make sure the model is generalising by asking it to predict the answers for our test dataset after each training run is finished. This gives us a more accurate indication of how well the model is learning.

What about the other parameters?

batch_size=32: We feed the model batches of 32 images at once, rather than one at a time. It's more computationally efficient and makes training more stable.

We're going to run 32 images through the model before we go back and adjust its weights. If we ran a single image at a time, we might make changes that are specific to that one image rather than generally applicable. A single outlier image might skew our weights in a way that doesn't make sense. Running 32 images through before making updates helps average out our adjustments.

shuffle=True for training: We randomise the order of our samples every time we run through them. This stops the model from learning "patterns" in the data based on the order it arrives, because we won't have patterns in our real-world data. We set it False for our test dataset because shuffling doesn't matter there.

num_workers=0: PyTorch lets us spin up multiple workers for large datasets, I've picked zero to keep things simple here.

Epochs and Learning Rate

num_epochs = 10
learning_rate = 0.001

An epoch is one complete pass through the training data. 1 epoch would show the model each image in our training dataset once, 100 epochs would show each image 100 times.

Why do we need more than one epoch?

We make adjustments to the model each time we pass a batch of images through it. The more passes, the more it gets to learn.

Each time we make an update, it tends to be relatively small. This is related to our learning rate and why learning rates tend to be very small numbers. We want stable training, not big changes in the model's behaviour. More passes allow those gradual (hopefully) improvements to compound.

We do hit a limit though, where more epochs with the same dataset don't teach the model anything new. I'll show you how to spot that soon.

Optimiser and Loss

model = OrientationModel()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss()

We need an instance of our model to train, so we create one.

Here we set up the core training components:

optim.Adam is the optimisation algorithm we're using. It adjusts our model's weights to steer it closer to the correct answer. We configure Adam with all the trainable parameters in our model (this is figured out by PyTorch for us) and it will adapt the learning rate for each one individually.

Another popular optimiser you might see in example code is SGD. Adam builds on SGD and is a safe choice to start with. If you want to go deep, here's the full Adam paper.

nn.MSELoss is Mean Squared Error loss, the function that scores how good an answer from our model was, based on the correct answer (label) in our dataset.

If we used equals as a loss function (return 1.0 for correct or 0.0 for incorrect):

🤓

Well, actually... We calculate the loss on batches of samples, not just one.

loss = float(model_prediction == sample_label)

...we wouldn't know how much to adjust our model. "How wrong?" is the question the loss function answers.

Why MSE? It penalises a model prediction based on how far it is from the correct answer. Bigger errors get penalised more because it squares the difference:

loss = (sample_label - model_prediction) ** 2

This means we fix the worst predictions first, rather than fine-tuning answers that are already mostly right.

Convert to Angle

Just a helper function for the trigonometry we covered earlier.

def sincos_to_angles(sines: torch.Tensor, cosines: torch.Tensor) -> torch.Tensor:
    return torch.atan2(sines, cosines) * 180 / math.pi

Both sines and cosines are 1-dimensional tensors and we return a 1-dimensional tensor of angles.

We could loop through these with a for-loop or list comprehension, but using the torch.atan2 is faster and won't need to be changed if we run this code on a GPU.

Outer Training Loop

Our outer loop sets up two variables we'll use to calculate some training stats: start_time and running_loss. They're only used for output, so we can see how well our training is going, they're not part of the training process itself.

for epoch in range(num_epochs):
    start_time = time.monotonic()
    running_loss = 0.0
    model.train()

The interesting line here is model.train(), it puts the model in training mode. Some layers in models behave differently during training compared to evaluation.

Our model doesn't have any layers like this (yet) but it's a good idea to include this line in your training code.

Inner Training Loop

This is where the training actually happens. The processes at work here are backpropagation and gradient descent. We'll step through both of those in more detail soon, but here's the high-level version.

    for images, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    avg_loss = running_loss / len(train_loader)

We run through all the images and labels (correct answers) in our training dataset, in batches of 32 images at a time (see batch_size in the Dataloaders section above).

optimizer.zero_grad() resets our optimiser's training state from the previous batch. The grad are those gradients we'll cover and zero_grad zeroes them out.

outputs = model(images) runs our model over this batch of images and collects the answers as outputs.

loss = criterion(outputs, labels) runs our loss function (MSELoss explained above) on those outputs, to score how correct our model's answers were.

Both loss and outputs are tensors:

outputs is a 2D tensor of answers for each image, e.g. [[0.45, 0.12], [-0.9, 0.4], ... [0.6, -0.3]] with a pair of sine, cosine predictions for each image in the batch
loss is a scalar tensor, it has one value: the mean squared error across this batch

🤓

Well, actually... our gradients tell us how much changing each weight, and in which direction (bigger or smaller value), would reduce the error

loss.backward() computes our gradients: how much each weight in our model contributed to any error.

Finally, optimizer.step() tells our optimiser to update our model weights based on those gradients, that is apply changes based on our calculations from the last step.

To print some stats about our training run, we record the "wrongness" of our model for every batch and then calculate an average. If we see our loss (wrongness score) plateau, it means our model has stopped learning.

Evaluation Setup

After each complete run through the training dataset (epoch), we evaluate our model's performance. Similar to running_loss and start_time above, this isn't strictly necessary to train our model. It gives us an idea of how a time (and power) consuming process is running though. We might be able to stop training earlier or we might see a problem before the full run is complete.

    model.eval()
    within_5 = 0
    within_10 = 0
    within_20 = 0
    total = 0
    with torch.no_grad():

We're using our test_loader dataset here, so we can check the model's performance on data it hasn't seen.

model.eval() switches to evaluation mode, the opposite of model.train()

torch.no_grad() tells PyTorch we don't need to keep track of gradients here, we're not going to perform any backpropagation. PyTorch does some behind the scenes magic to track every change to a tensor, this makes our training code much easier to write, but we can turn it off for evaluations.

You can see this for yourself at the Python REPL:

>>> import torch
>>> a = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
>>> b = a * 2
>>> b
tensor([2., 4., 6.], grad_fn=<MulBackward0>)

PyTorch has tracked how we created b from a by multiplication using the MulBackward0 function.

We can also create c from addition:

>>> c = a + 2
>>> c
tensor([3., 4., 5.], grad_fn=<AddBackward0>)

and we get AddBackward0.

If we use torch.no_grad(), that's disabled, saving memory and compute time:

>>> with torch.no_grad(): d = a * 4
... 
>>> d
tensor([ 4.,  8., 12.])

This isn't something we need to worry about, just a look behind the scenes at what PyTorch is doing for us.

Evaluation

    for images, labels in test_loader:
            outputs = model(images)

            predicted_angles = sincos_to_angles(outputs[:, 0], outputs[:, 1])
            true_angles = sincos_to_angles(labels[:, 0], labels[:, 1])
            angle_diff = torch.abs(predicted_angles - true_angles)
            angle_diff = torch.min(angle_diff, 360 - angle_diff)

            within_5 += (angle_diff < 5).sum().item()
            within_10 += (angle_diff < 10).sum().item()
            within_20 += (angle_diff < 20).sum().item()
            total += labels.size(0)

For each test batch we:

Get answers (predictions) from the model
Convert both predictions and labels (model's answers and the correct answers) from sine and cosine back to angles
Calculate the absolute difference
Handle wraparound: if the difference is greater than 180°, the shorter path is 360 - diff. For example, predicting 5° when the answer is 355° is only 10° off, not 350°
Count how many predictions fall within 5°, 10°, and 20° of the true angle

We could look at the overall loss here, but combined wrongness for our model's predictions but that can be hard to interpret. Instead, we have three buckets of answers, so we can see gradual improvements more clearly within our model. For some applications, being within 10° is fine, for others our model might need to beat 2°.

This Code Looks Funny

If you're not used to PyTorch or other Python data science libraries, this looks odd:

within_20 += (angle_diff < 20).sum().item()

Let's break it down.

Through the magic of operator overloading, angle_diff < 20 compares each element in the tensor to 20 and returns a tensor of booleans:

>>> angle_diff = tensor([3.2, 25.1, 8.7, 42.0])
>>> angle_diff < 20
tensor([ True, False,  True, False])

.sum() treats True as 1 and False as 0, giving us a count:

>>> (angle_diff < 20).sum()
tensor(2)

.item() extracts the value from a single-element tensor as a plain Python number:

>>> (angle_diff < 20).sum().item()
2

We need .item() because we're storing our result into a regular Python integer (within_20), not another tensor.

Explain This

total += labels.size(0)

labels is a 2-dimensional tensor like: [[0.45, 0.12], [-0.9, 0.4], ... [0.6, -0.3]]. A list of sine and cosine pairs, where each pair are a single correct answer.

labels.size(0) returns the size of the first dimension of our labels tensor, which is the total number of answers it contains. In our case it will be 32, because we're batching 32 images at a time in our test_loader DataLoader.

Showing Progress

Nearly there! Our next block just prints the stats we've collected at the end of each epoch.

    seconds_taken = time.monotonic() - start_time
    print(f"[Epoch {epoch + 1}/{num_epochs}] after {seconds_taken:.1f} seconds:")
    print(f"              loss: {avg_loss:.4f}")
    print(f"  within 5 degrees: {within_5 / total * 100:.2f}%")
    print(f" within 10 degrees: {within_10 / total * 100:.2f}%")
    print(f" within 20 degrees: {within_20 / total * 100:.2f}%")
    print()

If everything's going well, we should see loss decreasing and accuracy improving.

Save and Quit

Finally, we save the trained model weights to disk.

torch.save(model.state_dict(), "orientation-model.pth")

state_dict() gives us a dict of all the trainable parameters in our model. When we want to use our model to rotate images, we'll load the parameters from this file.

Run and See

Run your training script if you haven't already. Experiment with some of the parameters: the number of epochs or the learning rate.

In the next post, I'll show you how to interpret the results.