Loading Datasets for PyTorch

2026-02-04

This is part of a series on ML for generalists, you can find the start here.

We'll put everything in train.py for now, to keep things simple.

PyTorch expects our data to live inside a Dataset, a list-like container where it can access individual training samples. We'll wrap our answersheet.json and the images it points to in a custom OrientationDataset class.

It looks like a lot at first, but we'll work through it bit by bit.

# train.py

import json
import math
from pathlib import Path

import torch
from torch.utils.data import Dataset
from PIL import Image
from torchvision import transforms


class OrientationDataset(Dataset):
    def __init__(self, answersheet_path: Path):
        self.sample_path = answersheet_path.parent
        self.samples = json.loads(answersheet_path.read_text())
        self.to_tensor = transforms.ToTensor()

    def __getitem__(self, index: int) -> tuple[torch.Tensor, torch.Tensor]:
        sample = self.samples[index]
        filename = self.sample_path / sample["filename"]

        image = Image.open(filename).convert("L")
        radians = sample["degrees"] * math.pi / 180
        answer = torch.tensor([math.sin(radians), math.cos(radians)], dtype=torch.float32)
        
        return (self.to_tensor(image), answer)

    def __len__(self) -> int:
        return len(self.samples)

What's a Tensor?

A tensor is just a multi-dimensional array. That's it. The fancy name comes from mathematics, but for our purposes you can think of it as "numpy array, but GPU-friendly."

Why not regular arrays? Tensors are more efficient on GPUs and can have some extra characteristics that make the training step of a model easier, but we'll get to that later.

For now, let me convince you a tensor is a multi-dimensional array.

A single number is a 0-dimensional tensor (a scalar):

┌───┐
│ 1 │
└───┘

A list of numbers is a 1-dimensional tensor (a vector):

┌───┬───┬───┬───┬───┐
│ 1 │ 2 │ 3 │ 4 │ 5 │
└───┴───┴───┴───┴───┘

A grid of numbers is a 2-dimensional tensor (a matrix):

┌───┬───┬───┬───┬───┐
│ 1 │ 2 │ 3 │ 4 │ 5 │
├───┼───┼───┼───┼───┤
│ 6 │ 7 │ 8 │ 9 │ 10│
├───┼───┼───┼───┼───┤
│ 11│ 12│ 13│ 14│ 15│
└───┴───┴───┴───┴───┘

Stack a bunch of matrices and you've got a 3-dimensional tensor.

Our grayscale image is a 2D grid of pixel values, so it's a 2D tensor: height x width

A colour image has three channels (red, green, blue), so it's a 3D tensor: 3 x height × width

A batch of 32 colour images? That's a 4D tensor: 32 × 3 x height × width

PyTorch uses tensors for everything because they can be shipped to a GPU and processed in parallel. When you call transforms.ToTensor() on an image, it converts those pixel values into a tensor that PyTorch knows how to work with.

Why `math.pi`?

This is trigonometry, and I promise it's the most maths in the whole tutorial. The model will predict the sine and cosine of the rotation angle rather than the angle itself.

Why not just the angle, why not predict 42°? Because angles wrap around. If we asked the model to predict a number between 0 and 360, it would think 2° and 359° are far apart when they're not: they're 3° apart:

Circles, eh?

Sine and cosine don't have this problem. They represent the angle as a position on the edge of a circle, so nearby angles always have nearby values.

The model learns more easily when similar inputs produce similar outputs. Maths folks would call these continuous values.

Yes, I had to remind myself how right-angle triangles work to write this.

The `return (tensor, answer)` Convention

This tuple format is what PyTorch expects from a Dataset: here's the data, here's the correct answer. In our case, both are tensors. The answer could also be an int for classification problems, but since we're predicting a smooth range of values (sin and cos can be anything from -1 to 1), we use a tensor.

Import Code

I learn faster when I can play with these objects directly. The built-in Python debugger is fine, but I treat myself to a full REPL by adding:

import code; code.interact(local=locals())

You should add it anywhere that looks interesting and run the script.

What's In Our Data

Let's temporarily add these two lines to the bottom of train.py:

train_dataset = OrientationDataset(Path("data-train/answersheet.json"))
print(train_dataset[0])

Here's what I get:

(tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.]]]), tensor([-0.2756,  0.9613]))

Your values will be different because we have different training data, but the structure will be the same.

We have two tensor objects, the first is our image and it's represented as a 3-dimensional array. Two of those dimensions make sense: height x width, where does the third come from? Convention and consistency: PyTorch keeps a colour dimension in the tensor so that the model always has a colour channel.

With our greyscale images we have 1 x height x width, if we switched to colour images we'd have 3 x height x width.

You want to train a model and here I am telling you PyTorch trivia. Well, this is the data our model will consume and transform as it passes through its internal layers: we won't get far if we don't understand its shape.

Don't feel intimidated by the strange looking output though, it's different from regular Python lists but you can still count the square brackets [[[ to see the 3 dimensions.

What's In Our Answer

The second tensor object is my answer: tensor([-0.2756, 0.9613])

From our code, we know this is the sine and cosine of the angle our image is rotated to. We've encoded our answer angle:

radians = sample["degrees"] * math.pi / 180
answer = torch.tensor([math.sin(radians), math.cos(radians)], dtype=torch.float32)

Don't worry about the trigonometry here, but this is the maths incantation to turn those sine and cosine values back to an angle:

>>> import math
>>> math.atan2(-0.2756, 0.9613) * 180 / math.pi
-15.99733772216544
>>> _ + 360
344.0026622778346

We get -15.997, by adding 360 we get the positive angle which comes out to 344.

Looking at my answersheet.json we can see that's correct for my first sample:

  {
    "filename": "sample-0000.png",
    "degrees": 344
  }

Our Dataset implementation is working nicely.

Everything Else

You've probably figured out what __len__ does, the convert("L") after the image load is where the greyscale conversion happens. Colour isn't important for our model.

Boring bit done. Let's make a model!