Make a Model

2026-02-05

This is part of a series on ML for generalists, you can find the start here.

We're at the fun bit, where we decide on our model architecture and build out the layers.

The first surprise is just how little code this takes.

# train.py
import torch
import torch.nn as nn

class OrientationModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )

        self.flatten = nn.Flatten()
        
        self.fc = nn.Sequential(
            nn.LazyLinear(out_features=64),
            nn.ReLU(),
            nn.Linear(in_features=64, out_features=2),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.conv(x)
        x = self.flatten(x)
        x = self.fc(x)
        return x

It's ok, the code looks strange but it will make sense. Let's cover the important bits first.

There are 5 convolutional layers:

nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1),
nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1),

and 2 fully connected layers:

nn.LazyLinear(out_features=64),
nn.Linear(in_features=64, out_features=2),

The short version:

The 2-dimensional convolutional layers identify patterns in 2D space by looking at small squares of the image (kernel=3 gives us a 3x3 pixel square). It's a bit like an artist squinting at a landscape to block out rough shapes and features.

The fully connected layers take all those detected patterns and combine them to predict our final answer: the sine and cosine of the angle (out_features=2).

Even though our model is made up of convolution layers and fully connected layers, the convention is to call our entire model a Convolutional Neural Network (CNN).

In ML-speak, these are the hidden layers of our model, in that these layers live inside our model (like encapsulation).

If you look at your favourite model on HuggingFace, you'll see the number of hidden layers it has buried in the config.json: image

That depends on what "good" means. The typical tradeoff in machine learning is speed vs. accuracy, and where you land on that spectrum depends on the problem you're solving. A model that needs to respond in milliseconds will look different to one that can mull on a batch job overnight.

For now, I'm not worried about either. This is just our starting point. Over the next few posts I'll break down each component in the model individually, then we'll wire them together and start training.

Once we have end-to-end training and evaluation working, that's when the real experimentation begins. We can start changing the shape of the model and see if those changes help.

We're going to take a quick look at each of the concepts in our model:

We're building our model's neural network with convolutional layers and fully connected layers.

Then we'll get to the training bit, I promise.

First up: convolutional layers.