Federated ML — LydianAI

Most distributed training tools assume you're going to provision a homogeneous cluster — same GPU model, same CUDA version, same memory budget. If your actual compute is spread across a MacBook Pro, a workstation running an old GTX 1080, and a Jetson, that assumption is fatal. The tools won't cooperate.

Federated Averaging works around this. The coordinator holds the global model. Each worker gets a copy, trains locally for a configurable number of epochs, and sends back a gradient update. The coordinator combines those updates and starts the next round. Workers never talk to each other. The coordinator doesn't care what GPU the worker has, as long as it can run the model and send back an update.

Pascal GPUs (GTX 1080, 1080Ti, Quadro P series) use CUDA compute capability sm_61. They require older PyTorch wheels and sometimes older Python. Most tools either don't mention this or fail silently. The install path is explicit here — the repository includes a LEGACY path that pins the right Torch version and documents the sm_61 constraint. This is a supported configuration, not an afterthought.

How the coordinator and workers collaborate

Server / Coordinator (macOS CPU OK)

Hosts a FastAPI API, shards CIFAR-10 across workers, aggregates updates using FedAvg, and tracks metrics per round.

Workers (Ubuntu GPU/CPU)

Each worker registers, downloads the current model, trains on its local data, and submits an update. Then it waits for the next round. Workers can be modern GPUs, legacy Pascal GPUs, or CPU-only.

1 worker

2 workers

4 workers

Effective compute scales with workers. Add a machine, add compute — no shared pool, no cloud account.

Key design goals

Heterogeneous hardware

Mixed compute is the default: different GPUs, different speeds, and even CPU-only machines.

Legacy GPU support

Pascal GPUs (sm_61) require older Torch wheels and often older Python. The PoC supports a LEGACY install path and a legacy Torch mode.

Simple networking

All machines join the same Tailscale tailnet. Workers connect to the server using stable 100.x IPs.

Roadmap

Start with working code on real hardware. Then harden it.

Now: Distributed training PoC

FastAPI coordinator
FedAvg aggregation
Heterogeneous workers
NEW vs LEGACY GPU support
CLI: start/monitor/results

Next: reliability + observability

Round watchdogs and timeouts
Better worker diagnostics
Structured logging + metrics
More robust state handling

Then: Inference-first runtime

A clean inference mode (jobs, batching, routing) is the next practical wedge once the cluster mechanics are solid.

Later: optional managed control plane

Hosted control plane is optional. Self-hosting remains first-class.

Training across machines you already own.