LydianAI distributed training
Getting Started

Quickstart

Run the FedAvg training PoC: start server, start workers, launch a training run, and monitor results.


Quickstart (FedAvg + FastAPI)

This PoC performs federated/distributed training across heterogeneous machines:


0) Networking (Tailscale)

On all machines:

sudo tailscale up
tailscale ip -4

Note the server’s 100.x IP — workers will need it. See the Tailscale networking guide for more detail.


1) Clone the repo

git clone https://github.com/polyplay/lydianai_ml.git
cd lydianai_ml

2) Set up the server (coordinator)

The server runs on macOS or Linux. CPU-only is fine.

python3 -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements_server.txt

Start the coordinator:

python -m server.main --host 0.0.0.0 --port 8000

The server listens on port 8000 and exposes a FastAPI API for workers and the CLI client.


3) Set up workers

Workers run on Ubuntu machines with GPU or CPU. Choose your install path based on your hardware — see the NEW vs LEGACY GPU guide.

NEW GPU (RTX 20/30/40, A100, H100):

python3.12 -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements_new_gpu.txt

LEGACY GPU (GTX 1080 / 1080 Ti):

python3.10 -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements_legacy_gpu.txt

Start a worker (point it at the server’s Tailscale IP):

python -m worker.main --server http://<server-ip>:8000

The worker registers with the coordinator, then enters a loop: poll → download model → train → submit update.


4) Launch a training run

From any machine that can reach the server:

python -m client.submit_job --server http://<server-ip>:8000 start

This starts FedAvg training. The coordinator shards CIFAR-10 across registered workers and begins round 1.


5) Monitor and fetch results

# Check training status
python -m client.submit_job --server http://<server-ip>:8000 status

# Watch round-by-round progress
python -m client.submit_job --server http://<server-ip>:8000 monitor

# Fetch final results
python -m client.submit_job --server http://<server-ip>:8000 results

# List connected workers
python -m client.submit_job --server http://<server-ip>:8000 workers

Next steps