Quick Recap before Training

2026-02-08

This is part of a series on ML for generalists, you can find the start here.

A quick recap of the components in our model.

import torch
import torch.nn as nn

class OrientationModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )

        self.flatten = nn.Flatten()
        
        self.fc = nn.Sequential(
            nn.LazyLinear(out_features=64),
            nn.ReLU(),
            nn.Linear(in_features=64, out_features=2),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.conv(x)
        x = self.flatten(x)
        x = self.fc(x)
        return x

Convolution Layers

nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1)

Gives us 32 sliding 3x3 kernels, each kernel has 1 pixel of padding at the edges of our image:

Use Conv2d when you have data that can be represented in 2-dimension (there're also Conv1d and Conv3d).

ReLU Activations

A layer can learn new features based on the output of the previous layer because of non-linearity. We convert negative values to zeros by calling nn.ReLU()

Without this, all our layers effectively (mathematically) become a single layer.

Use after every trainable layer, but probably not your output/last layer.

MaxPool2d

For our problem, it doesn't matter if a feature appears a pixel to the left or a pixel lower in our image. We want some "translation invariance" -- we really only care about the significant features.

nn.MaxPool2d(kernel_size=2, stride=2) slides a 2x2 window (kernel_size) across each out_channel of our convolution layer, stepping by 2 (stride) each time and outputs only the largest value in each window.

Use between convolutional layers to reduce data volume and generalise.

Flatten

Takes the 3-dimensional tensor from our convolution layers and flattens it into the 1-dimensional tensor that fully connected layers expect.

Use between your last convolutional or pooling (our MaxPool2d) layer and your first fully connected layer.

Fully Connected Layers

Every input connects to every output in a fully connected layer. Each layer has the same number of perceptrons as it has outputs.

nn.Linear(in_features=64, out_features=2),

We have 64 inputs and 2 outputs in this layer. That's 2 perceptrons with each perceptron having 64 inputs (and 64 weights), so this layer of the network will have 2 outputs (1 for each perceptron).

Each perceptron also has a bias -- an extra parameter set during training that lets a perceptron shift its output up or down.

Here's a single perceptron with 3 inputs:

Use to make decisions based on features extracted by earlier layers. Typically at the end of your network.

Terminology

Loss: How "wrong" a model is. A loss function calculates how wrong a model was and returns a score.
Hidden Layers: Our model is called a Convolutional Neural Network, it has 7 layers internally. These are hidden from a consumer of the model.
Output layer: The final layer of a model that produces the answer.
Ground truth: The correct answer we're training our model to predict.
Tensor: A multi-dimensional array of numbers.
Scalar: A single number (a 0-dimensional tensor).
Vector: A list of numbers (a 1-dimensional tensor).
Matrix: A grid of numbers (a 2-dimensional tensor).
Channels: Separate "layers" in image data, like red, green, and blue for colour images.
Features: Patterns or characteristics the model learns to recognise in the data.
Perceptrons: The simplest neural network unit: multiply inputs by weights, add them up, output a number.
Weights: Numbers the model learns during training that determine how much each input matters.
Bias: A number added to a layer's output that shifts the result up or down.
Feature map: The output from sliding a kernel across an image, showing where patterns were detected.

Training

We're ready to train our model.