How federated ML works
This first LydianAI app is a federated/distributed training PoC: a FastAPI coordinator runs FedAvg rounds while workers (CPU-only or GPU) train locally and submit model updates.
Server / Coordinator (macOS CPU OK)
Hosts a FastAPI API, shards CIFAR-10 across workers, aggregates updates using FedAvg, and tracks metrics per round.
Workers (Ubuntu GPU/CPU)
Register → poll → download model → train locally → submit update → repeat. Workers can be modern GPUs, legacy Pascal GPUs, or CPU-only.
Key design goals
Heterogeneous hardware
Mixed compute is the default: different GPUs, different speeds, and even CPU-only machines.
Legacy GPU support
Pascal GPUs (sm_61) require older Torch wheels and often older Python. The PoC supports a LEGACY install path and a legacy Torch mode.
Simple networking
All machines join the same Tailscale tailnet. Workers connect to the server using stable 100.x IPs.
What v1 is (and isn't)
It is
- FedAvg training rounds
- FastAPI coordinator + worker loop
- CLI client to start/monitor/results
- NEW vs LEGACY GPU install paths
It isn't
- A production scheduler
- A hosted "managed service"
- A full marketplace / multi-tenant platform
Roadmap
Start with working code on real hardware. Then harden it.
Now: Distributed training PoC
- FastAPI coordinator
- FedAvg aggregation
- Heterogeneous workers
- NEW vs LEGACY GPU support
- CLI: start/monitor/results
Next: reliability + observability
- Round watchdogs and timeouts
- Better worker diagnostics
- Structured logging + metrics
- More robust state handling
Then: Inference-first runtime
A clean inference mode (jobs, batching, routing) is the next practical wedge once the cluster mechanics are solid.
Later: optional managed control plane
Hosted control plane is optional. Self-hosting remains first-class.