How it works
This first LydianAI app is a federated/distributed training PoC: a FastAPI coordinator runs FedAvg rounds while workers (CPU-only or GPU) train locally and submit model updates.
Server / Coordinator (macOS CPU OK)
Hosts a FastAPI API, shards CIFAR-10 across workers, aggregates updates using FedAvg, and tracks metrics per round.
Workers (Ubuntu GPU/CPU)
Register → poll → download model → train locally → submit update → repeat. Workers can be modern GPUs, legacy Pascal GPUs, or CPU-only.
Key design goals
Heterogeneous hardware
Mixed compute is the default: different GPUs, different speeds, and even CPU-only machines.
Legacy GPU support
Pascal GPUs (sm_61) require older Torch wheels and often older Python. The PoC supports a LEGACY install path and a legacy Torch mode.
Simple networking
All machines join the same Tailscale tailnet. Workers connect to the server using stable 100.x IPs.
What v1 is (and isn’t)
It is
- FedAvg training rounds
- FastAPI coordinator + worker loop
- CLI client to start/monitor/results
- NEW vs LEGACY GPU install paths
It isn’t
- A production scheduler
- A hosted “managed service”
- A full marketplace / multi-tenant platform