LydianAI distributed training
Concepts

Architecture

Server/coordinator + worker loop for FedAvg training over Tailscale.


Architecture

This app is a federated/distributed training PoC built around two components: a central coordinator and distributed workers.

Components

Server / Coordinator (FastAPI)

Workers

Each worker runs on a separate machine (Ubuntu recommended for GPU support):

  1. Register — worker connects to the coordinator and announces itself (hardware info, GPU type)
  2. Poll — worker checks if there’s work to do (a new round)
  3. Download — worker pulls the current global model
  4. Train — worker trains locally on its assigned data shard for N epochs
  5. Submit — worker sends model updates (gradients/weights) back to the coordinator
  6. Repeat — worker loops back to polling for the next round

Workers can be:

Networking

Machines communicate over Tailscale (100.x addresses). This avoids NAT issues and keeps the multi-node setup reproducible across different network environments.

See the Tailscale networking guide for setup instructions.

FedAvg algorithm

The coordinator implements Federated Averaging:

  1. Initialize — coordinator creates a global model
  2. Distribute — coordinator sends the global model to all workers
  3. Local training — each worker trains on its local data shard
  4. Aggregate — coordinator collects updates and averages them (weighted by data size)
  5. Update — the averaged result becomes the new global model
  6. Repeat — steps 2–5 repeat for N rounds

This approach allows training across machines without sharing raw data — only model updates are transmitted.

Project structure

lydianai_ml/
├── server/       # FastAPI coordinator
├── worker/       # Worker agent
├── client/       # CLI client (submit_job)
├── common/       # Shared utilities and model definitions
├── requirements_server.txt
├── requirements_new_gpu.txt
└── requirements_legacy_gpu.txt

Data flow

CLI (start) → Server → Workers register

         Round N begins

    Server sends global model → Workers

                            Local training

    Server ← model updates ← Workers

        FedAvg aggregation

        Round N+1 begins...