Data Augmentation

2026-02-13

This is part of a series on ML for generalists, you can find the start here.

We know the data is the important part. It can take a lot of work to compile an accurately labelled dataset. How do we make sure we get the most out of what we have?

We can create variations of our labelled images and, as long as we don't mess with their rotations, they'll still be correctly labelled.

An easy change we can make is to reduce the quality of the image. We'll convert it to a JPEG at low quality then back to a PNG.

import io

def jpeg_compress(image: Image) -> Image:
    buffer = io.BytesIO()
    image.save(buffer, format="JPEG", quality=10)
    buffer.seek(0)
    return Image.open(buffer).convert("L")

Here's what that looks like, before: image

and after our jpeg_compress: image

This is quite an aggressive drop in quality (quality=10), but our images don't have a lot of detail to begin with.

Another easy variation is to flip our image colours with PIL's ImageOps.invert()

image

We can also combine both, for inverted and compressed:

image

Here's our updated dataset implementation:

import io
from PIL import ImageOps

def no_change(image: Image) -> Image:
    return image


def jpeg_compress(image: Image) -> Image:
    buffer = io.BytesIO()
    image.save(buffer, format="JPEG", quality=10)
    buffer.seek(0)
    return Image.open(buffer).convert("L")


def compress_and_invert(image: Image) -> Image:
    return ImageOps.invert(jpeg_compress(image))


class OrientationDataset(Dataset):
    def __init__(self, answersheet_path: Path, augment: bool):
        self.sample_path = answersheet_path.parent
        
        augmentations = [no_change]
        if augment:
            augmentations.extend([
                jpeg_compress,
                ImageOps.invert,
                compress_and_invert,
            ])

        self.samples = []
        for sample in json.loads(answersheet_path.read_text()):
            for augmentation in augmentations:
                self.samples.append({
                    **sample,
                    "augmentation": augmentation,
                })
        self.to_tensor = transforms.ToTensor()

    def __getitem__(self, index: int) -> tuple[torch.Tensor, torch.Tensor]:
        sample = self.samples[index]
        filename = self.sample_path / sample["filename"]
        image = Image.open(filename).convert("L")
        image = sample["augmentation"](image)
        radians = sample["degrees"] * math.pi / 180
        answer = torch.tensor([math.sin(radians), math.cos(radians)], dtype=torch.float32)
        return (self.to_tensor(image), answer)

    def __len__(self) -> int:
        return len(self.samples)

We also need to update where we load our train and test datasets, we don't need to augment our test set:

train_dataset = OrientationDataset(
    Path("data-train/answersheet.json"),
    augment=True
)

test_dataset = OrientationDataset(
    Path("data-test/answersheet.json"),
    augment=False
)

You can grab this complete train.py here.

Now our 200 training samples have become 800, as each image has 4 variants:

Let's see what that does for training, we'll keep the nn.Dropout(0.25) between our fully connected layers:

[Epoch 19/20] after 187.6 seconds:
              loss: 0.0176
        -- TEST --
  within 5 degrees: 65.50%
 within 10 degrees: 90.00%
 within 20 degrees: 99.00%
        -- TRAIN --
  within 5 degrees: 94.12%
 within 10 degrees: 100.00%
 within 20 degrees: 100.00%

[Epoch 20/20] after 187.4 seconds:
              loss: 0.0172
        -- TEST --
  within 5 degrees: 62.00%
 within 10 degrees: 88.00%
 within 20 degrees: 99.00%
        -- TRAIN --
  within 5 degrees: 99.75%
 within 10 degrees: 100.00%
 within 20 degrees: 100.00%

Still not as good as having a bigger dataset, but the gap between test and train is a little better than dropout alone.

We've trained our model, how do we use it?