Running on GPUs

2026-02-16

This is part of a series on ML for generalists, you can find the start here.

On my machine, a dataset of 2,000 training samples taking about 10 minutes (for a single epoch) on my CPU. That's not terrible for one run, but I need to iterate and experiment. I might change a hyperparameter (like learning rate) or tweak the architecture to add more layers, or bigger layers, or... you get the idea. Waiting around makes that painful.

Ten minutes per epoch means I can't realistically try more than a handful of changes in a session. Context switching is rough. Kick off a training run, go do something else, come back, check the results, try remember what I was testing, hopefully I scribbled a note somewhere. Repeat. I can only change one variable at a time if I want to learn what's happening, so my progress crawls along.

Enter GPUs.

My CPU might have 8 cores and 16 threads. Each core is powerful, with deep pipelines, branch prediction, large caches, and out-of-order execution. They're built to handle complex, unpredictable workloads where each thread might be doing something completely different.

A GPU swaps all of that per-core fanciness for sheer numbers. An RTX 3060 has 3,584 cores. They're individually simple, no branch prediction, tiny caches, and they all need to be running the same instruction at the same time. But there are thousands of them!

Training a neural network involves a lot of matrix multiplication, which is "multiply a bunch of numbers together and add them up." These operations are massive and regular. Multiply these two 512x512 matrices together, that's 262,144 multiply-and-add operations, all independent of each other, all following the same pattern. This is exactly the kind of workload GPUs were built for. My 16 CPU threads can chew through it, but 3,584 GPU cores doing the same simple operation in lockstep will take a fraction of the time.

The same applies to inference (running a trained model on new data). Lots of independent multiplications, all happening at once.

If you're on an NVIDIA card, CUDA just works. It's been the standard for ML compute for years, and just about every framework supports it out of the box. When a tutorial says "GPU support" they mean NVIDIA.

MPS (Metal Performance Shaders) is Apple's answer for running PyTorch on Mac GPUs. In my experience, it can be as slow as CPU for some operations, and it nerfs your desktop performance while training. My Mac becomes sluggish because I'm hammering the same GPU that's drawing my UI.

MLX is Apple's newer ML framework, built specifically for Apple Silicon. It's promising, but support is still lagging. Apple research even publish models that require extra patches to work with MLX. The big bonus though, if your problem fits within what MLX supports, it's probably the cheapest way to get a lot of GPU-accessible RAM. Apple's unified memory architecture means the GPU can access all your system RAM, not just dedicated VRAM. A MacBook Pro with 96GB of unified memory gives you 96GB of GPU-accessible memory. The equivalent from NVIDIA would be much more expensive and a lot louder.

The single most important spec for ML workloads is VRAM (Video RAM). This is the memory on the GPU itself. Your model, your training data batches, and all the intermediate calculations during training need to fit in VRAM. Run out, and your training stops.

Maximise VRAM for your budget.

The best VRAM/€ deal you'll find is in ex-data centre hardware. I've seen NVIDIA Tesla P40s with 24GB of VRAM for less than €200 on eBay, shipping from China.

BUT these cards are designed for data centre airflow. They're passively cooled, they have no fans of their own and they will run hot. They expect to live in a chassis with constant, aggressive air flow. Putting one in a desktop case means you'll need to hack together your own cooling. It's a bigger gamble, and you may find yourself cable-tying fans to a heat sink. But if it works out, 24GB of VRAM for under €200 is exceptional value.

Another catch, the P40 is from NVIDIA's Pascal generation. Models from the Volta generation onwards (like the V100) support NVLink, a high-speed interconnect that lets you combine multiple cards and share memory across them. The cheap P40s don't support NVLink, so each card is on its own. For most home setups this won't matter, but it's worth knowing if you're planning a multi-GPU build.

If you want something that just works in a normal PC, the RTX 3060 with 12GB of VRAM is a safe bet. Brand new from a high street retailer, I can walk in today and buy one for €399. Used prices can be half that. I bought a second-hand gaming PC (complete with an impressive amount of flashing RGB lights) with a 12GB 3060 for less than €600.

A 3060 will train CNNs comfortably, let you fine-tune smaller LLMs, and give you enough room to experiment with most things you'd want to try while learning.

It won't let you run large LLMs (without quantisation, see the next section).

Watch out for the 8GB variant. NVIDIA sells a version of the 3060 with only 8GB of VRAM. Same name, much less useful for what we're doing. It has a narrower memory bus too, making it much slower even when working within its 8GB. I don't know a good way to check other than running nvidia-smi or some equivalent software tool.

A good deal on a 2060 or similar older card with decent VRAM is fine too. The important thing is the memory.

You might have seen open-weight LLMs listed by parameter count, something like Qwen3 8B. The 8B means 8 billion parameters, so 8 billion 32-bit floating point numbers. To load that model at full-precision requires 8 billion x 4 bytes for about 32GB of VRAM. To run that model at full-precision, a safe bet is to multiply the storage size by 1.3, so you'd need nearly 42GB of VRAM to use it.

Luckily, the actual value of those weights is less important than the relationship between each value. So we can use quantisation to reduce each weight to a 16-bit, 8-bit or even 4-bit number. Now our Qwen3 8B can run in 21GB, 10.4GB or even 6GB of VRAM. You lose a small amount of accuracy with each reduction but not as much as you might think. The 4-bit is noticeably less capable than the 32-bit when you're chatting to it, but it may still perform perfectly well at the task you want it to do.

In practice, most models on HuggingFace are already published at 16-bit precision, so the real-world starting point for quantisation is often 16-bit. You might see variants like Qwen-8B-Q4_K_M, this is a 4-bit quantisation.

Of all the things I didn't want to learn, this is the most. How to make sense of NVIDIAs model naming scheme. LEt's take "RTX 3060" as an example:

So a 2080 is a higher-tier card from an older generation, while the 3060 is a mid-tier card from a newer generation. For us, the tier matters less than the VRAM. A 3060 with 12GB is more useful than a 3070 with 8GB if your model doesn't fit in 8GB.

It's hard to compare across generations. A 2080 Ti will outperform a 3060 in raw compute, but the 3060 has newer architecture features and potentially more VRAM depending on the variant. When in doubt, go for VRAM.

The good news is that PyTorch makes this fairly straightforward. You need to do three things:

device = torch.device(
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)

This picks the best available option: NVIDIA GPU first, then Apple GPU, then CPU as a fallback. In practice, I don't bother with MPS. YMMV.

model = model.to(device)

Easy, eh?

samples = samples.to(model.device)
labels = labels.to(model.device)

Our training loop stays the same. PyTorch ships with its own CUDA driver, it all just works. It's worth the hassle of finding an NVIDIA card.

MPS support can be flaky for less common operations. Some PyTorch ops aren't implemented for MPS yet, and you'll get runtime errors that go away if you switch to CPU. This, combined with the horrible performance I get means I go CPU if I can't go CUDA.

Debugging is easier on CPU. If something is going wrong with your model, run it on CPU first. GPU error messages can be particularly cryptic, and operations happen asynchronously, so the error might not point to the actual problem.

Moving data between CPU and GPU is slow. Don't shuffle tensors back and forth to your CPU in a loop. Load your data onto the GPU, do all your computation there, and only move results back to CPU when you need to (e.g logging or saving).

Here's the same model we've been working with throughout this tutorial. On my CPU, a single epoch with 2000 training samples takes a little over 10 minutes.

On a dataset five times larger (10,000 training samples), running on an NVIDIA card a teenager decided was too slow:

[Epoch 10/10] after 57.7 seconds:
              loss: 0.0008
  within 5 degrees: 98.40%
 within 10 degrees: 99.70%
 within 20 degrees: 100.00%

Ten epochs on 10,000 samples in under a minute. I can try something, see the result almost immediately, then try something else.

That's it! That's the full tutorial, or part one at least. You can check out a repo containing all the code here.