b4rtaz/distributed-llama — Gitpedia

Distributed Llama

Connect home devices into a powerful cluster to accelerate LLM inference. More devices mean faster performance, leveraging tensor parallelism and high-speed synchronization over Ethernet.

Supports Linux, macOS, and Windows. Optimized for ARM and x86_64 AVX2 CPUs.

How to Run

News

16 Sep 2025 - Qwen 3 MoE models are now supported on Vulkan.
5 Sep 2025 - Qwen 3 MoE models are now supported on CPU.
3 Aug 2025 - Qwen 3 0.6B, 1.7B, 8B and 14B models are now supported.
23 Mar 2025 - 🌋 Experimental Vulkan support
12 Feb 2025 - 🚧 Merged the fundamental codebase refactor
9 Jan 2025 - 🍎 Llama 3.3 70B on 4 x Mac Mini M4 Pro 24GB RAM

🔥 Setup Root Node by Single Command

Python 3 and C++ compiler required. The command will download the model and the tokenizer.

Model	Size	Command
Llama 3.1 8B Instruct Q40	6.32 GB	`python launch.py llama3_1_8b_instruct_q40`
Llama 3.1 405B Instruct Q40	238 GB	`python launch.py llama3_1_405b_instruct_q40`.
Llama 3.2 1B Instruct Q40	1.7 GB	`python launch.py llama3_2_1b_instruct_q40`
Llama 3.2 3B Instruct Q40	3.4 GB	`python launch.py llama3_2_3b_instruct_q40`
Llama 3.3 70B Instruct Q40	40 GB	`python launch.py llama3_3_70b_instruct_q40`
DeepSeek R1 Distill Llama 8B Q40	6.32 GB	`python launch.py deepseek_r1_distill_llama_8b_q40`
Qwen 3 0.6B Q40	0.9 GB	`python launch.py qwen3_0.6b_q40`
Qwen 3 1.7B Q40	2.2 GB	`python launch.py qwen3_1.7b_q40`
Qwen 3 8B Q40	6.7 GB	`python launch.py qwen3_8b_q40`
Qwen 3 14B Q40	10.9 GB	`python launch.py qwen3_14b_q40`
Qwen 3 30B A3B Q40	17.0 GB	`python launch.py qwen3_30b_a3b_q40`

🛠️ Convert Model Manually

🤗 How to Convert Hugging Face Model

🚧 Known Limitations

You can run Distributed Llama only on 1, 2, 4... 2^n nodes.
The maximum number of nodes is equal to the number of KV heads in the model #70.
Only the following quantizations are supported #183:
- q40 model with q80 buffer-float-type
- f32 model with f32 buffer-float-type

👷 Architecture

[🔀 SWITCH OR ROUTER]
      | | | |
      | | | |_______ 🔸 device1 (ROOT)     10.0.0.1
      | | |_________ 🔹 device2 (WORKER 1) 10.0.0.2:9999
      | |___________ 🔹 device3 (WORKER 2) 10.0.0.3:9999
      |_____________ 🔹 device4 (WORKER 3) 10.0.0.4:9999
                        ...

The project is split up into two parts:

🔸 Root node - it's responsible for loading the model and weights and forward them to workers. Also, it synchronizes the state of the neural network. The root node is also a worker, it processes own slice of the neural network.
🔹 Worker node - it processes own slice of the neural network. It doesn't require any configuration related to the model.

You always need the root node and you can add 2^n - 1 worker nodes to speed up the inference. The RAM usage of the neural network is split up across all nodes. The root node requires a bit more RAM than worker nodes.

🎹 Commands

dllama inference - run the inference with a simple benchmark,
dllama chat - run the CLI chat,
dllama worker - run the worker node,
dllama-api - run the API server.

<details> <summary>🎹 Supported Arguments</summary>

<br />Inference, Chat, API

Argument	Description	Example
`--model <path>`	Path to model.	`dllama_model_meta-llama-3-8b_q40.m`
`--tokenizer <path>`	Tokenizer to model.	`dllama_tokenizer_llama3.t`
`--buffer-float-type <type>`	Float precision of synchronization.	`q80`
`--workers <workers>`	Addresses of workers (ip:port), separated by space.	`10.0.0.1:9999 10.0.0.2:9999`
`--max-seq-len <n>`	The maximum sequence length, it helps to reduce the RAM usage.	`4096`

Inference, Chat, Worker, API

Argument	Description	Example
`--nthreads <n>`	Amount of threads. Don't set a higher value than number of CPU cores.	`4`

Worker, API

Argument	Description	Example
`--host <addr>`	Binding address.	`127.0.0.1`
`--port <port>`	Binding port.	`9999`

Inference

Argument	Description	Example
`--prompt <prompt>`	Initial prompt.	`"Hello World"`
`--steps <steps>`	Number of tokens to generate.	`256`

</details>

📊 Measurements

Please check the discussions section, where many measurements were published on different configurations.

✋ Contribution

Feel free to contribute to this project. For small changes, simply create a new merge request. For larger changes, please create an issue to discuss your plans. Please follow these guidelines when contributing:

Make only minimal changes and avoid modifying files that are not necessary.
Ensure the code is compatible across all supported systems and CPUs.
This repository is maintained in English.

💡 License

This project is released under the MIT license.

📖 Citation

@misc{dllama,
  author = {Bartłomiej Tadych},
  title = {Distributed Llama},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/b4rtaz/distributed-llama}},
  commit = {7eb77ca93ec0d502e28d36b6fb20039b449cbea4}
}