Roadmap
Start with working code on real hardware. Then harden it.
Now: Distributed training PoC
- FastAPI coordinator
- FedAvg aggregation
- Heterogeneous workers
- NEW vs LEGACY GPU support
- CLI: start/monitor/results
Next: reliability + observability
- Round watchdogs and timeouts
- Better worker diagnostics
- Structured logging + metrics
- More robust state handling
Then: Inference-first runtime
A clean inference mode (jobs, batching, routing) is the next practical wedge once the cluster mechanics are solid.
Later: optional managed control plane
Hosted control plane is optional. Self-hosting remains first-class.