GitPedia

Llm server

Auto-tuned launcher for GGUF models on llama.cpp / ik_llama.cpp — OpenAI-compatible server with multi-GPU tensor-split, MoE expert placement, measured flag tuning (AI Tune), hardware-matched HuggingFace downloads, and crash recovery. An Ollama alternative for multi-GPU rigs.

From raketenkater·Updated June 20, 2026·View on GitHub·

**Stop hand-writing `--tensor-split`, `-ot`, and KV-cache flags.** Point llm-server at a GGUF and it measures your GPUs, RAM, and PCIe topology, picks the backend (llama.cpp or the faster ik_llama.cpp fork), computes multi-GPU and MoE expert placement, and serves an OpenAI-compatible API — one command from file to endpoint. The project is written primarily in Go, distributed under the MIT License license, first published in 2026. Key topics include: cuda, gguf, golang, inference-server, llama-cpp.

Latest release: v3.0.0
June 11, 2026View Changelog →

llm-server

License
Release
Platform
Backends

Stop hand-writing --tensor-split, -ot, and KV-cache flags. Point llm-server
at a GGUF and it measures your GPUs, RAM, and PCIe topology, picks the backend
(llama.cpp or the faster ik_llama.cpp fork), computes multi-GPU and MoE expert
placement, and serves an OpenAI-compatible API — one command from file to endpoint.

bash
llm-server model.gguf # local GGUF → served llm-server unsloth/Qwen3.6-27B-GGUF --download # HF repo → hardware-matched quant → served llm-server # no args → interactive TUI

demo

Hardware-matched recommendations → one-command launch → benchmark, all from the same tool.

Just run llm-server with no arguments to open the full arrow-key TUI — browse and
download models, adjust settings, and launch, without writing a single flag. Pass a model
path or flags for one-shot CLI use instead.

Benchmarks

Same rig (RTX 3090 Ti 24GB + 4070 12GB + 3060 12GB), same GGUFs, 32k context,
decode tok/s (256-token generation), slowest backend on the left:

Model (quant)Ollama 0.30.8llama.cpp --fitllm-server v3v3 --ai-tunev3 vs Ollama
Qwen3.5-4B Q4_K_M124.8103.3176.6178.8+43%
Qwen3.6-27B Q5_K_M22.824.340.340.3+77%
Qwen3.5-122B-A10B UD-IQ4_XS (MoE)13.5†21.023.623.6+74%
MiniMax-M3 UD-IQ3_XXS (MoE)✗ won't load✗ won't load5.475.50Ollama can't load

† Ollama can't import sharded GGUFs (ollama#5245),
so the 122B was merged to one file before importing; MiniMax-M3 it can't load at all
(minimax-m3 is ik_llama-only). Where models load, llm-server is 43–77% faster than
Ollama — including +74% on the 122B MoE
at heavy VRAM+RAM offload (60 GB, ~18 GB spilled
to RAM). Driving the same llama.cpp master binary (no ik_llama), llm-server still beat raw
--fit — so the gain is the placement, not just the backend swap. Full methodology,
exact commands, and artifacts: docs/performance.md. Numbers are
reproducible with scripts/bench-v3-comparison.sh
regressions against these tables are treated as bugs.

Install

Linux / macOS — self-contained app home under ~/llm-server:

bash
curl -fsSL https://raw.githubusercontent.com/raketenkater/llm-server/main/setup.sh | bash

Windows (PowerShell); add -Backend cuda for native NVIDIA CUDA:

powershell
iwr -useb https://raw.githubusercontent.com/raketenkater/llm-server/main/install.ps1 | iex

From a clone:

bash
git clone https://github.com/raketenkater/llm-server.git && cd llm-server && ./setup.sh

Since v3, prebuilt release bundles
(Linux CPU/Vulkan, macOS arm64 Metal, Windows x86_64 CPU) install without compiling,
verified against SHA256SUMS; Linux CUDA/ik_llama.cpp builds from source for your GPU.
Run llm-server with no arguments to open the TUI. Installer options and the app-home
layout are in docs/install.md.

Quick start

bash
llm-server ~/models/model.gguf # launch a local model llm-server unsloth/Qwen3.6-27B-GGUF --download # download a fitting quant, then launch llm-server model.gguf --ai-tune # benchmark flag sets, cache the fastest llm-server model.gguf --dry-run # print the backend command without running llm-server model.gguf --benchmark # load, measure tok/s, exit

Common flags: --backend ik_llama|llama|vulkan, --gpus 0,1, --ctx-size,
--kv-quality, --kv-placement, --vision, --spec auto. Unknown flags pass straight
through to llama-server, so nothing upstream is out of reach. Full list:
docs/usage.md.

How it compares

vs raw llama.cpp. Upstream --fit auto-picks GPU layers, tensor-split, and context.
If that covers you, raw llama.cpp may be enough. llm-server goes further: it selects the
backend (ik_llama.cpp is meaningfully faster on CUDA), picks KV-cache type and batch sizes
from measured probes, benchmarks candidate flag sets (--ai-tune), finds/validates vision
projectors and speculative drafts, and recovers from crashes.

vs Ollama. Ollama wins on one-command simplicity and ecosystem on common hardware.
llm-server targets where Ollama's conservative heuristics leave performance behind:
mismatched multi-GPU rigs, MoE models split across VRAM/RAM, ik_llama.cpp speed, and full
flag access. One GPU and want zero config? Use Ollama.

vs llama-swap. llama-swap hot-swaps between model commands you write yourself;
llm-server computes those commands. They compose — point llama-swap at llm-server dry-run
output, or use llm-server daemon for single-model swapping.

Capabilityraw llama.cppllm-server
Multi-GPU / heterogeneous split--fit (recent)automatic, PCIe/bandwidth-weighted
MoE expert placement--fit / manual -otexact per-GPU ledger, backend-aware
Backend selection (ik_llama / llama / Vulkan)manualautomatic, dialect-aware
KV-cache type / batch sizingmanualprobe-measured
AI Tune (measured flag search)noyes, cached per model+hardware
Hardware-matched quant downloadnoyes (HF search + intelligence ranking)
Vision projector / speculative decodingmanualautomatic, validated
Crash recovery / backend fallbacknoyes

Features

  • One Go binary; Linux, macOS, and native Windows. CUDA / Vulkan / Metal / CPU.
  • Exact-ledger multi-GPU + MoE expert placement (--tensor-split + -ot from measured
    VRAM and GGUF sizes), with adaptive retry on out-of-memory.
  • AI Tune — benchmarks candidate flag sets and caches the fastest valid result per
    model + hardware; a community tune pool seeds first launches (LLM_COMMUNITY_TUNES=off).
  • Hugging Face downloader with hardware-aware quant selection and a GUI recommendation
    picker ranked by intelligence-per-fit.
  • Speculative decoding (MTP, EAGLE-3, validated draft GGUFs) and vision (mmproj) support.
  • OpenAI-compatible server, arrow-key TUI, crash recovery with backend fallback.

Backends

ik_llama.cpp (CUDA, source build) · llama.cpp (Vulkan, Metal, CPU) · native Windows CUDA
via install.ps1 -Backend cuda. The backend binary is pluggable via LLAMA_SERVER.

Requirements

  • Linux: curl, git, python3; cmake/compiler + NVIDIA CUDA toolkit for CUDA
    source builds; vulkaninfo for Vulkan detection.
  • macOS: Apple Silicon; Xcode command-line tools for source builds.
  • Windows: Windows 10/11 x86_64, PowerShell 5+, Python; CUDA Toolkit + VS C++ Build
    Tools for -Backend cuda.

Documentation

Install ·
Usage ·
Architecture ·
Performance ·
Speculative decoding ·
Model recommendations ·
Changelog

License

MIT

Contributors

Showing top 2 contributors by commit count.

View all contributors on GitHub →

This article is auto-generated from raketenkater/llm-server via the GitHub API.Last fetched: 6/21/2026