Quick Recap before Training

2026-02-08

This is part of a series on ML for generalists, you can find the start here.

A quick recap of the components in our model.

import torch
import torch.nn as nn

class OrientationModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )

        self.flatten = nn.Flatten()
        
        self.fc = nn.Sequential(
            nn.LazyLinear(out_features=64),
            nn.ReLU(),
            nn.Linear(in_features=64, out_features=2),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.conv(x)
        x = self.flatten(x)
        x = self.fc(x)
        return x
nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1)

Gives us 32 sliding 3x3 kernels, each kernel has 1 pixel of padding at the edges of our image:

convolution process animation with 1 pixel of padding

Use Conv2d when you have data that can be represented in 2-dimension (there're also Conv1d and Conv3d).

A layer can learn new features based on the output of the previous layer because of non-linearity. We convert negative values to zeros by calling nn.ReLU()

Without this, all our layers effectively (mathematically) become a single layer.

Use after every trainable layer, but probably not your output/last layer.

For our problem, it doesn't matter if a feature appears a pixel to the left or a pixel lower in our image. We want some "translation invariance" -- we really only care about the significant features.

nn.MaxPool2d(kernel_size=2, stride=2) slides a 2x2 window (kernel_size) across each out_channel of our convolution layer, stepping by 2 (stride) each time and outputs only the largest value in each window.

maxpool2d diagram

Use between convolutional layers to reduce data volume and generalise.

Takes the 3-dimensional tensor from our convolution layers and flattens it into the 1-dimensional tensor that fully connected layers expect.

Use between your last convolutional or pooling (our MaxPool2d) layer and your first fully connected layer.

Every input connects to every output in a fully connected layer. Each layer has the same number of perceptrons as it has outputs.

nn.Linear(in_features=64, out_features=2),

We have 64 inputs and 2 outputs in this layer. That's 2 perceptrons with each perceptron having 64 inputs (and 64 weights), so this layer of the network will have 2 outputs (1 for each perceptron).

Each perceptron also has a bias -- an extra parameter set during training that lets a perceptron shift its output up or down.

Here's a single perceptron with 3 inputs:

perceptron diagram

Use to make decisions based on features extracted by earlier layers. Typically at the end of your network.

We're ready to train our model.