LydianAI distributed inference

Roadmap

Start with working code on real hardware. Then harden it.

Now: Distributed training PoC

FastAPI coordinator
FedAvg aggregation
Heterogeneous workers
NEW vs LEGACY GPU support
CLI: start/monitor/results

Next: reliability + observability

Round watchdogs and timeouts
Better worker diagnostics
Structured logging + metrics
More robust state handling

Then: Inference-first runtime

A clean inference mode (jobs, batching, routing) is the next practical wedge once the cluster mechanics are solid.

Later: optional managed control plane

Hosted control plane is optional. Self-hosting remains first-class.