Training across machines you already own.
The insight behind federated ML is simple — if each machine trains on its own slice of data and only shares its gradient update, you don't need a shared GPU pool. What looks like a cluster constraint becomes a protocol.
Most distributed training tools assume you're going to provision a homogeneous cluster — same GPU model, same CUDA version, same memory budget. If your actual compute is spread across a MacBook Pro, a workstation running an old GTX 1080, and a Jetson, that assumption is fatal. The tools won't cooperate.
Federated Averaging works around this. The coordinator holds the global model. Each worker gets a copy, trains locally for a configurable number of epochs, and sends back a gradient update. The coordinator combines those updates and starts the next round. Workers never talk to each other. The coordinator doesn't care what GPU the worker has, as long as it can run the model and send back an update.
Pascal GPUs (GTX 1080, 1080Ti, Quadro P series) use CUDA compute capability sm_61. They require older PyTorch wheels and sometimes older Python. Most tools either don't mention this or fail silently. The install path is explicit here — the repository includes a LEGACY path that pins the right Torch version and documents the sm_61 constraint. This is a supported configuration, not an afterthought.
GPU utilization — idle vs. training
~3 h/week
typical gaming GPU
100%
during a training round
Most GPUs sit idle almost all the time. Federated training uses what's already there.
How the coordinator and workers collaborate
Server / Coordinator (macOS CPU OK)
Hosts a FastAPI API, shards CIFAR-10 across workers, aggregates updates using FedAvg, and tracks metrics per round.
Workers (Ubuntu GPU/CPU)
Each worker registers, downloads the current model, trains on its local data, and submits an update. Then it waits for the next round. Workers can be modern GPUs, legacy Pascal GPUs, or CPU-only.
Effective compute scales with workers. Add a machine, add compute — no shared pool, no cloud account.
Key design goals
Heterogeneous hardware
Mixed compute is the default: different GPUs, different speeds, and even CPU-only machines.
Legacy GPU support
Pascal GPUs (sm_61) require older Torch wheels and often older Python. The PoC supports a LEGACY install path and a legacy Torch mode.
Simple networking
All machines join the same Tailscale tailnet. Workers connect to the server using stable 100.x IPs.
What v1 is (and isn't)
It is
- FedAvg training rounds
- FastAPI coordinator + worker loop
- CLI client to start/monitor/results
- NEW vs LEGACY GPU install paths
It isn't
- A production scheduler
- A hosted "managed service"
- A full marketplace / multi-tenant platform
Roadmap
Start with working code on real hardware. Then harden it.
Now: Distributed training PoC
- FastAPI coordinator
- FedAvg aggregation
- Heterogeneous workers
- NEW vs LEGACY GPU support
- CLI: start/monitor/results
Next: reliability + observability
- Round watchdogs and timeouts
- Better worker diagnostics
- Structured logging + metrics
- More robust state handling
Then: Inference-first runtime
A clean inference mode (jobs, batching, routing) is the next practical wedge once the cluster mechanics are solid.
Later: optional managed control plane
Hosted control plane is optional. Self-hosting remains first-class.