LydianAI distributed inference

Roadmap

Start with working code on real hardware. Then harden it.

Now: Distributed training PoC

  • FastAPI coordinator
  • FedAvg aggregation
  • Heterogeneous workers
  • NEW vs LEGACY GPU support
  • CLI: start/monitor/results

Next: reliability + observability

  • Round watchdogs and timeouts
  • Better worker diagnostics
  • Structured logging + metrics
  • More robust state handling

Then: Inference-first runtime

A clean inference mode (jobs, batching, routing) is the next practical wedge once the cluster mechanics are solid.

Later: optional managed control plane

Hosted control plane is optional. Self-hosting remains first-class.