Architecture
This app is a federated/distributed training PoC built around two components: a central coordinator and distributed workers.
Components
Server / Coordinator (FastAPI)
- Hosts the REST API on port 8000
- Assigns and shards data (CIFAR-10 baseline) across registered workers
- Aggregates model updates using FedAvg (Federated Averaging)
- Tracks metrics (loss, accuracy) per round
- Manages training state: round progression, worker registration, result collection
- Runs on macOS or Linux — CPU-only is fine for the coordinator
Workers
Each worker runs on a separate machine (Ubuntu recommended for GPU support):
- Register — worker connects to the coordinator and announces itself (hardware info, GPU type)
- Poll — worker checks if there’s work to do (a new round)
- Download — worker pulls the current global model
- Train — worker trains locally on its assigned data shard for N epochs
- Submit — worker sends model updates (gradients/weights) back to the coordinator
- Repeat — worker loops back to polling for the next round
Workers can be:
- NEW GPU (RTX 20/30/40, A100, H100) — current PyTorch + CUDA
- LEGACY GPU (GTX 1080 / 1080 Ti) — pinned legacy Torch stack
- CPU-only — slower but functional
Networking
Machines communicate over Tailscale (100.x addresses). This avoids NAT issues and keeps the multi-node setup reproducible across different network environments.
See the Tailscale networking guide for setup instructions.
FedAvg algorithm
The coordinator implements Federated Averaging:
- Initialize — coordinator creates a global model
- Distribute — coordinator sends the global model to all workers
- Local training — each worker trains on its local data shard
- Aggregate — coordinator collects updates and averages them (weighted by data size)
- Update — the averaged result becomes the new global model
- Repeat — steps 2–5 repeat for N rounds
This approach allows training across machines without sharing raw data — only model updates are transmitted.
Project structure
lydianai_ml/
├── server/ # FastAPI coordinator
├── worker/ # Worker agent
├── client/ # CLI client (submit_job)
├── common/ # Shared utilities and model definitions
├── requirements_server.txt
├── requirements_new_gpu.txt
└── requirements_legacy_gpu.txt
Data flow
CLI (start) → Server → Workers register
↓
Round N begins
↓
Server sends global model → Workers
↓
Local training
↓
Server ← model updates ← Workers
↓
FedAvg aggregation
↓
Round N+1 begins...