Data Augmentation
2026-02-13
This is part of a series on ML for generalists, you can find the start here.
We know the data is the important part. It can take a lot of work to compile an accurately labelled dataset. How do we make sure we get the most out of what we have?
We can create variations of our labelled images and, as long as we don't mess with their rotations, they'll still be correctly labelled.
JPEG Compression
An easy change we can make is to reduce the quality of the image. We'll convert it to a JPEG at low quality then back to a PNG.
import io
def jpeg_compress(image: Image) -> Image:
buffer = io.BytesIO()
image.save(buffer, format="JPEG", quality=10)
buffer.seek(0)
return Image.open(buffer).convert("L")
Here's what that looks like, before:

and after our jpeg_compress:

This is quite an aggressive drop in quality (quality=10), but our images don't have a lot of detail to begin with.
Inverted Colours
Another easy variation is to flip our image colours with PIL's ImageOps.invert()

We can also combine both, for inverted and compressed:

Augmented Dataset
Here's our updated dataset implementation:
import io
from PIL import ImageOps
def no_change(image: Image) -> Image:
return image
def jpeg_compress(image: Image) -> Image:
buffer = io.BytesIO()
image.save(buffer, format="JPEG", quality=10)
buffer.seek(0)
return Image.open(buffer).convert("L")
def compress_and_invert(image: Image) -> Image:
return ImageOps.invert(jpeg_compress(image))
class OrientationDataset(Dataset):
def __init__(self, answersheet_path: Path, augment: bool):
self.sample_path = answersheet_path.parent
augmentations = [no_change]
if augment:
augmentations.extend([
jpeg_compress,
ImageOps.invert,
compress_and_invert,
])
self.samples = []
for sample in json.loads(answersheet_path.read_text()):
for augmentation in augmentations:
self.samples.append({
**sample,
"augmentation": augmentation,
})
self.to_tensor = transforms.ToTensor()
def __getitem__(self, index: int) -> tuple[torch.Tensor, torch.Tensor]:
sample = self.samples[index]
filename = self.sample_path / sample["filename"]
image = Image.open(filename).convert("L")
image = sample["augmentation"](image)
radians = sample["degrees"] * math.pi / 180
answer = torch.tensor([math.sin(radians), math.cos(radians)], dtype=torch.float32)
return (self.to_tensor(image), answer)
def __len__(self) -> int:
return len(self.samples)
We also need to update where we load our train and test datasets, we don't need to augment our test set:
train_dataset = OrientationDataset(
Path("data-train/answersheet.json"),
augment=True
)
test_dataset = OrientationDataset(
Path("data-test/answersheet.json"),
augment=False
)
You can grab this complete train.py here.
Training Run
Now our 200 training samples have become 800, as each image has 4 variants:
- unmodified original
- JPEG compression
- inverted colours
- JPEG compression with inverted colours
Let's see what that does for training, we'll keep the nn.Dropout(0.25) between our fully connected layers:
[Epoch 19/20] after 187.6 seconds:
loss: 0.0176
-- TEST --
within 5 degrees: 65.50%
within 10 degrees: 90.00%
within 20 degrees: 99.00%
-- TRAIN --
within 5 degrees: 94.12%
within 10 degrees: 100.00%
within 20 degrees: 100.00%
[Epoch 20/20] after 187.4 seconds:
loss: 0.0172
-- TEST --
within 5 degrees: 62.00%
within 10 degrees: 88.00%
within 20 degrees: 99.00%
-- TRAIN --
within 5 degrees: 99.75%
within 10 degrees: 100.00%
within 20 degrees: 100.00%
Still not as good as having a bigger dataset, but the gap between test and train is a little better than dropout alone.
Using Our Model
We've trained our model, how do we use it?