LydianAI Open-source tooling

Training across machines you already own.

The insight behind federated ML is simple — if each machine trains on its own slice of data and only shares its gradient update, you don't need a shared GPU pool. What looks like a cluster constraint becomes a protocol.

Most distributed training tools assume you're going to provision a homogeneous cluster — same GPU model, same CUDA version, same memory budget. If your actual compute is spread across a MacBook Pro, a workstation running an old GTX 1080, and a Jetson, that assumption is fatal. The tools won't cooperate.

Federated Averaging works around this. The coordinator holds the global model. Each worker gets a copy, trains locally for a configurable number of epochs, and sends back a gradient update. The coordinator combines those updates and starts the next round. Workers never talk to each other. The coordinator doesn't care what GPU the worker has, as long as it can run the model and send back an update.

Pascal GPUs (GTX 1080, 1080Ti, Quadro P series) use CUDA compute capability sm_61. They require older PyTorch wheels and sometimes older Python. Most tools either don't mention this or fail silently. The install path is explicit here — the repository includes a LEGACY path that pins the right Torch version and documents the sm_61 constraint. This is a supported configuration, not an afterthought.

GPU utilization — idle vs. training

~3 h/week
typical gaming GPU

100%
during a training round

Most GPUs sit idle almost all the time. Federated training uses what's already there.

Local only no raw data leaves the worker
sm_61 Pascal GPU — fully supported
HTTP no direct worker-to-worker traffic

How the coordinator and workers collaborate

Coordinator RTX 4090 Modern GPU GTX 1080 Ti Pascal / sm_61 CPU-only MacBook / Jetson model (down) gradient update (up)

Server / Coordinator (macOS CPU OK)

Hosts a FastAPI API, shards CIFAR-10 across workers, aggregates updates using FedAvg, and tracks metrics per round.

Workers (Ubuntu GPU/CPU)

Each worker registers, downloads the current model, trains on its local data, and submits an update. Then it waits for the next round. Workers can be modern GPUs, legacy Pascal GPUs, or CPU-only.

1 worker
2 workers
4 workers

Effective compute scales with workers. Add a machine, add compute — no shared pool, no cloud account.


Key design goals

Heterogeneous hardware

Mixed compute is the default: different GPUs, different speeds, and even CPU-only machines.

Legacy GPU support

Pascal GPUs (sm_61) require older Torch wheels and often older Python. The PoC supports a LEGACY install path and a legacy Torch mode.

Simple networking

All machines join the same Tailscale tailnet. Workers connect to the server using stable 100.x IPs.


What v1 is (and isn't)

It is

  • FedAvg training rounds
  • FastAPI coordinator + worker loop
  • CLI client to start/monitor/results
  • NEW vs LEGACY GPU install paths

It isn't

  • A production scheduler
  • A hosted "managed service"
  • A full marketplace / multi-tenant platform

Roadmap

Start with working code on real hardware. Then harden it.

Now: Distributed training PoC

  • FastAPI coordinator
  • FedAvg aggregation
  • Heterogeneous workers
  • NEW vs LEGACY GPU support
  • CLI: start/monitor/results

Next: reliability + observability

  • Round watchdogs and timeouts
  • Better worker diagnostics
  • Structured logging + metrics
  • More robust state handling

Then: Inference-first runtime

A clean inference mode (jobs, batching, routing) is the next practical wedge once the cluster mechanics are solid.

Later: optional managed control plane

Hosted control plane is optional. Self-hosting remains first-class.

Clone, install & run → Read the docs Get in touch