Generate Test and Training Data

2026-02-03

This is part of a series on ML for generalists, you can find the start here.

We need images to train our model, so let's create some synthetic data. We'll need images rotated at varying degrees and we'll also need our ground truth, the correct answer: how many degrees we rotated the image by.

Without the ground truth, we can't tell our model how wrong its predictions are so we can't train it and improve its answers.

You can let your eyes glaze over for this code, it's not important, it just gets us a dataset we can work with:

Creates an output directory
Generates the requested number of images:
- each image is a 480x480 square
- contains random lines of text which may be in different sizes
- is rotated by a random number of degrees
Writes an answersheet.json with the ground truth (correct answer) rotation for each image

The text comes from the WORDS constant. Change it if you want different text in your images.

# generate.py

import argparse
import json
import random
from pathlib import Path
from PIL import Image, ImageDraw, ImageFont


WORDS = """
     Call me Ishmael.  Some years ago--never mind how
long precisely --having little or no money in my purse, and nothing particular
to interest me on shore, I thought I would sail about a little and see the
watery part of the world.  It is a way I have of driving off the spleen, and
regulating the circulation.  Whenever I find myself growing grim about the
mouth; whenever it is a damp, drizzly November in my soul; whenever I find
myself involuntarily pausing before coffin warehouses, and bringing up the
rear of every funeral I meet; and especially whenever my hypos get such an
upper hand of me, that it requires a strong moral principle to prevent me
from deliberately stepping into the street, and methodically knocking
people's hats off--then, I account it high time to get to sea as soon as I can.
""".strip().split()


def generate_random_line():
    word_count = random.randint(3, 12)
    return " ".join(random.choice(WORDS) for _ in range(word_count))


def generate_image(index, output_dir):
    degrees = random.randint(0, 359)

    img = Image.new("L", (480, 480), color=255)
    draw = ImageDraw.Draw(img)

    line_count = random.randint(2, 8)
    margin_top = random.randint(20, 60)
    margin_left = random.randint(20, 60)
    line_spacing = random.randint(5, 20)

    y_position = margin_top

    for _ in range(line_count):
        line_text = generate_random_line()
        x_position = margin_left + random.randint(-10, 30)

        font_size = random.randint(16, 32)
        font = ImageFont.load_default(size=font_size)

        draw.text((x_position, y_position), line_text, fill=0, font=font)
        bbox = draw.textbbox((x_position, y_position), line_text, font=font)
        text_height = bbox[3] - bbox[1]
        y_position += text_height + line_spacing + random.randint(-5, 10)

        if y_position > 420:
            break

    if degrees != 0:
        img = img.rotate(-degrees, expand=False, fillcolor=255)

    img = img.point(lambda x: 0 if x < 128 else 255, "1")

    filename = f"sample-{index:04d}.png"
    filepath = output_dir / filename
    img.save(filepath)

    return filename, degrees


def main():
    parser = argparse.ArgumentParser(description="Generate synthetic training data for text orientation detection")
    parser.add_argument("-o", "--output", required=True, type=Path, help="output path for images and answersheet.json")
    parser.add_argument("-c", "--count", type=int, required=True, help="number of samples to generate")

    args = parser.parse_args()

    output_path = args.output
    output_path.mkdir(parents=True, exist_ok=True)

    answersheet = []
    for i in range(args.count):
        filename, degrees = generate_image(i, output_path)
        answersheet.append({"filename": filename, "degrees": degrees})

    answersheet_path = output_path / "answersheet.json"
    with open(answersheet_path, "w") as f:
        json.dump(answersheet, f, indent=2)

    print(f"Answersheet saved to {answersheet_path} with {args.count} samples")


if __name__ == "__main__":
    main()

The answersheet.json lists the name of each file and how many degrees of rotation were applied:

[
  {
    "filename": "sample-0000.png",
    "degrees": 334
  },
  {
    "filename": "sample-0001.png",
    "degrees": 187
  },
  {
    "filename": "sample-0002.png",
    "degrees": 295
  }
]

Generate Datasets

To start, we'll create two datasets: one for training and one for testing our model.

python generate.py --count 800 --output data-train
python generate.py --count 200 --output data-test

Your data will be different, but here's one from my data-train/answersheet.json:

  {
    "filename": "sample-0006.png",
    "degrees": 85
  },

Here's the corresponding sample-0006.png

Looks close enough to 90°-ish that the answer 85° seems right.

Take a look through some of your sample images and see if they make sense.

Loading The Data

Now we're ready to load our datasets for PyTorch.