Understanding Training Output
2026-02-10
This is part of a series on ML for generalists, you can find the start here.
This is the output I get from our train.py on my machine, an aging Intel Mac:
% python train.py
/Users/brian/src/whichway/.venv/lib/python3.12/site-packages/torch/nn/modules/lazy.py:181: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '
[Epoch 1/10] after 200.1 seconds:
loss: 0.2828
within 5 degrees: 45.50%
within 10 degrees: 75.50%
within 20 degrees: 94.00%
[Epoch 2/10] after 188.5 seconds:
loss: 0.0387
within 5 degrees: 46.00%
within 10 degrees: 72.50%
within 20 degrees: 95.50%
[Epoch 3/10] after 188.5 seconds:
loss: 0.0270
within 5 degrees: 46.50%
within 10 degrees: 79.00%
within 20 degrees: 97.00%
[Epoch 4/10] after 187.3 seconds:
loss: 0.0200
within 5 degrees: 54.50%
within 10 degrees: 84.00%
within 20 degrees: 98.00%
[Epoch 5/10] after 212.1 seconds:
loss: 0.0147
within 5 degrees: 54.00%
within 10 degrees: 86.50%
within 20 degrees: 97.50%
[Epoch 6/10] after 218.6 seconds:
loss: 0.0111
within 5 degrees: 64.50%
within 10 degrees: 91.00%
within 20 degrees: 97.00%
[Epoch 7/10] after 218.8 seconds:
loss: 0.0094
within 5 degrees: 63.00%
within 10 degrees: 90.00%
within 20 degrees: 98.00%
[Epoch 8/10] after 215.1 seconds:
loss: 0.0083
within 5 degrees: 65.50%
within 10 degrees: 92.00%
within 20 degrees: 98.50%
[Epoch 9/10] after 210.2 seconds:
loss: 0.0066
within 5 degrees: 68.50%
within 10 degrees: 93.50%
within 20 degrees: 98.50%
[Epoch 10/10] after 210.6 seconds:
loss: 0.0054
within 5 degrees: 70.00%
within 10 degrees: 93.50%
within 20 degrees: 98.50%
After our first training epoch, we're already getting within 20° of the right answer for 94% of our images. That's a good start. By epoch 10, loss has dropped to 0.0054 and we're within 5° for 70% of our images. Our model is learning and our numbers are headed in the right direction.
Wait, What's Loss Again?
Loss is a measure of how wrong our model's predictions are. We're using Mean Squared Error (MSE), which penalises "wronger" answers more heavily because it squares the difference between the prediction and the correct answer.
Lower is better.
Steadyish Improvement
Here's our loss over all 10 epochs:
loss: 0.2828
loss: 0.0387
loss: 0.0270
loss: 0.0200
loss: 0.0147
loss: 0.0111
loss: 0.0094
loss: 0.0083
loss: 0.0066
loss: 0.0054
It's decreasing steadily, slowing as the model improves.
More Epochs?
Is there more value to get out of our training data? Let's crank num_epochs up to 20 and see what happens:
[Epoch 18/20] after 211.4 seconds:
loss: 0.0005
within 5 degrees: 80.00%
within 10 degrees: 95.00%
within 20 degrees: 99.00%
[Epoch 19/20] after 211.3 seconds:
loss: 0.0004
within 5 degrees: 81.50%
within 10 degrees: 95.00%
within 20 degrees: 99.00%
[Epoch 20/20] after 207.5 seconds:
loss: 0.0003
within 5 degrees: 78.00%
within 10 degrees: 94.50%
within 20 degrees: 99.00%
There's improvement but we're at the point of diminishing returns. In fact, our last epoch has slightly worse performance.
Sometimes you might see an occasional epoch where loss doesn't change much or even increases. Accuracy might drop too. If this happens once or twice it isn't something to be concerned about.
If there's a sustained drop in performance, that would indicate a problem.
Training Volume
We generated 1,000 total sample images back in this post:
python generate.py --count 800 --output data-train
python generate.py --count 200 --output data-test
That 80% train and 20% test is a common way to split the dataset.
It's rare you'll have a problem where you can generate as much data as you want. You're more likely to have a limited labelled dataset (samples with a known correct answer).
What would happen if we had less data?