Oxibonsai
OxiBonsai is a zero-FFI, zero-C/C++ inference engine for PrismML's sub-2-bit Bonsai family — both the 1-bit line (Q1_0_g128) and the ternary line (TQ2_0_g128). It runs on CPU (SIMD), Apple Silicon (Metal), and NVIDIA (CUDA) without depending on llama.cpp, BLAS, or any C/Fortran runtime.
**Pure Rust Sub-2-Bit LLM Inference Engine for PrismML Bonsai Models** The project is written primarily in Rust, distributed under the Apache License 2.0 license, first published in 2026. Key topics include: bonsai, pure-rust, qwen3, rust, rust-crate.
OxiBonsai
(オキシ盆栽)
Pure Rust Sub-2-Bit LLM Inference Engine for PrismML Bonsai Models
OxiBonsai is a zero-FFI, zero-C/C++ inference engine for PrismML's sub-2-bit Bonsai family — both the 1-bit line (Q1_0_g128) and the ternary line (TQ2_0_g128). It runs on CPU (SIMD), Apple Silicon (Metal), and NVIDIA (CUDA) without depending on llama.cpp, BLAS, or any C/Fortran runtime. Built entirely on the COOLJAPAN ecosystem — SciRS2, OxiBLAS, OxiFFT — it delivers sovereign AI inference in Pure Rust.
To our knowledge, OxiBonsai is the first pure-Rust — C/C++/Fortran-free, zero-FFI — inference engine for the Bonsai 1-bit/ternary model family, and the first to bring its FLUX.2-Klein text-to-image (Bonsai-Image) to pure Rust, built entirely on the COOLJAPAN ecosystem.
Documentation
- CLI reference — every
oxibonsaiandoxibonsai-servesubcommand, flag, and environment variable. - Image-generation guide — end-to-end Bonsai-Image (FLUX.2-Klein) text-to-image walkthrough.
Status
Version 0.2.2 — 2026-06-08 · 4,671 tests passing · ~177k lines of Rust · Pure Rust
| Crate | Status | Tests |
|---|---|---|
| oxibonsai-core | Stable | 207 |
| oxibonsai-kernels | Stable | 675 |
| oxibonsai-model | Stable | 673 |
| oxibonsai-runtime | Stable | 796 |
| oxibonsai-tokenizer | Stable | 206 |
| oxibonsai-rag | Stable | 871 |
| oxibonsai-eval | Stable | 513 |
| oxibonsai-serve | Stable | 260 |
| oxibonsai-image | Stable | 72 |
| oxibonsai (facade) | Stable | 352 |
Features
Sub-2-Bit Native Inference
Two native quantization families, each with dedicated dequant / GEMV / full-forward kernels:
| Family | Encoding | Bits/weight | Block size | Example models |
|---|---|---|---|---|
| 1-bit | Q1_0_g128 | 1.0 | 128 weights, FP16 group scale | Bonsai-8B |
| Ternary | TQ2_0_g128 | ≈1.585 | 128 weights / 34 B, FP16 scale | Ternary-Bonsai-8B / 4B / 1.7B |
- Full Qwen3 architecture: multi-layer decoder, GQA, SwiGLU, RoPE, RMSNorm
{-1, 0, +1}ternary encoding:0b00→−1, 0b01→0, 0b10→+1, 0b11→0- Correctness gate: at
--temperature 0 --seed 42, CPU and Metal produce byte-identical output
Acceleration Tiers
| Tier | Target | Width / Device | Feature Flag |
|---|---|---|---|
| Reference | All platforms | Scalar | (default) |
| AVX2 + FMA | x86-64 | 256-bit | simd-avx2 |
| AVX-512 | x86-64 | 512-bit | simd-avx512 |
| NEON | AArch64 | 128-bit | simd-neon |
| Metal | Apple Silicon | GPU, fused full-forward | metal |
| CUDA (native) | NVIDIA GPU | GPU, NVRTC kernels | native-cuda |
| CUDA (scirs2) | NVIDIA GPU | GPU via scirs2-core | cuda |
Auto-detection via KernelDispatcher::auto_detect() selects the best CPU tier at runtime. GPU backends are opt-in at build time.
Note on CPU tiers: The CPU tier is chosen entirely at runtime via
is_x86_feature_detected!— the dispatcher picks AVX-512 only when AVX-512F+BW+VL are all present, otherwise AVX2+FMA, otherwise the scalar reference path. Each SIMD function carries a per-function#[target_feature(...)]attribute, so a single x86-64 binary is safe on every x86-64 CPU and automatically falls back (AVX-512 → AVX-2 → scalar) with no SIGILL. Thesimd-avx2/simd-avx512/simd-neonFeature Flags above are accepted for compatibility but do not gate tier selection — all tiers are always compiled in and chosen at runtime.AVX-512 has been absent from Intel consumer CPUs since Alder Lake (Raptor Lake, Meteor Lake, Arrow Lake and Lunar Lake have none); it mainly benefits Xeon / HEDT and AMD Zen 4+. On consumer hardware the AVX-2 tier is selected automatically.
There is currently no INT8 dot-product tier (AVX-VNNI
vpdpbusd/ NEON-UDOTvdotq_s32): the 1-bit and ternary kernels expand weights to ±scale and accumulate in FP32 FMA. An INT8 dot-product tier — which would require quantizing activations to INT8 — is a possible future enhancement.
Fused GPU Full-Forward Path
Both the 1-bit and ternary forward passes are encoded into a single GPU command buffer rather than one submission per GEMV. Per-layer dispatch sequence:
- Pre-attn RMSNorm
- Fused QKV GEMV (Q ‖ K ‖ V concatenated in weight SoA)
- Fused QK-norm + RoPE
- Fused KV-store
- Batched attention: scores V2 → softmax → weighted-sum
- Attn output GEMV + residual add
- FFN RMSNorm
- Gate + Up GEMV (gate ‖ up concatenated)
- Batched SwiGLU
- Down GEMV + residual add
= 14 dispatches/layer × N layers per command buffer. This is what unlocks the Metal and CUDA throughput numbers below.
Observability
- Structured logging via
tracingwith env-filter and JSON output - Inference metrics: tokens/sec, prefill/decode latency, request counts
- Health endpoint (
/health) with readiness checks - Circuit breaker for overload protection
- Per-request tracing IDs via
RequestId(RFC 4122 UUIDv4, nouuidcrate dependency) - Per-request rate metrics via
RequestRateTracker— TBT p50/p95, EWMA tokens/sec, queue-wait - Workload aggregator —
RequestRateAggregatorrolls per-request snapshots intooxibonsai_request_tokens_per_second,oxibonsai_inter_token_latency_p50/p95_seconds, andoxibonsai_queue_wait_secondsPrometheus gauges
Runtime Controllers (0.1.4)
Two adaptive controllers shipped in 0.1.4 let the runtime self-tune as the workload changes:
rustuse oxibonsai_runtime::{KvCachePolicy, AdaptiveLookahead, AdaptiveLookaheadConfig}; // KV cache policy: FP16 ↔ Q8 ↔ Q4 driven by EWMA pressure with hysteresis. let kv = KvCachePolicy::default(); let level = kv.observe(0.92); // → escalates to Q8 once smoothed pressure crosses 0.80 // Speculative-decoding draft length: continuously updated from acceptance EWMA. let mut k = AdaptiveLookahead::new(AdaptiveLookaheadConfig::default()); k.observe_step(5, 4); // proposed=5, accepted=4 → k drifts toward 5
A worked end-to-end example lives in examples/runtime_controllers.rs:
bashcargo run --example runtime_controllers
OpenAI-Compatible API
/v1/chat/completionsendpoint (POST)- Streaming SSE support for real-time token output
/v1/modelsendpoint- CORS and tower middleware
Builder Pattern API
rustuse oxibonsai_runtime::{EngineBuilder, SamplingPreset}; let engine = EngineBuilder::new() .model_path("models/Ternary-Bonsai-1.7B.gguf") .preset(SamplingPreset::Balanced) .max_seq_len(4096) .build()?;
Sampling Presets
| Preset | Temperature | Top-K | Top-P | Use Case |
|---|---|---|---|---|
| Greedy | 0.0 | 1 | 1.0 | Deterministic |
| Balanced | 0.7 | 40 | 0.9 | General |
| Creative | 1.0 | 100 | 0.95 | Creative writing |
| Code | 0.2 | 10 | 0.8 | Code generation |
Bonsai Model Family
OxiBonsai supports PrismML's full Bonsai lineup across both quantization families:
| Model | Arch | Params | Format | Size | Context |
|---|---|---|---|---|---|
| Bonsai-8B | Qwen3-8B | 8.19 B | Q1_0_g128 | 1.15 GB | 65,536 |
| Ternary-Bonsai-8B | Qwen3-8B | 8.19 B | TQ2_0_g128 | ~1.75 GB | 65,536 |
| Ternary-Bonsai-4B | Qwen3-4B | ~4 B | TQ2_0_g128 | ~900 MB | 65,536 |
| Ternary-Bonsai-1.7B | Qwen3-1.7B | ~1.7 B | TQ2_0_g128 | ~390 MB | 65,536 |
Ternary weights trade roughly +600 MB (at 8B scale) for ~5 additional benchmark points over the 1-bit line. All models share the same Qwen3 architecture (GQA, SwiGLU, RoPE, RMSNorm), so the runtime, tokenizer, and server are identical across the family.
Note: PrismML publishes Ternary Bonsai as unpacked safetensors. Use
scripts/download_ternary.sh(oroxibonsai convert --quant tq2_0_g128) to fetch and repack as GGUF before loading. Anonnx-communityONNX release (MatMulNBits bits=2) is also supported viaoxibonsai convert --onnx.
Installation
CLI (recommended for end users)
bashcargo install oxibonsai-cli
This installs the oxibonsai binary. Rust 1.86+ required.
Library (for Rust projects)
toml[dependencies] oxibonsai = "0.2.2"
Build from source (for development)
bashgit clone https://github.com/cool-japan/oxibonsai cd oxibonsai cargo build --release # binary at: target/release/oxibonsai
Configuration (.env)
The CLI auto-loads a .env file from the current directory (or any parent), so you can
omit the model/path flags. Precedence: --flag > shell env var > .env > built-in default.
bash# Fetch the template from GitHub … curl -fsSL https://raw.githubusercontent.com/cool-japan/oxibonsai/master/.env.example -o .env # … or, in a source checkout: cp .env.example .env # Edit .env to point at your model files $EDITOR .env
Keys:
| Key | Used by | Purpose |
|---|---|---|
OXI_MODEL | run / chat / serve / info | GGUF model path (omit --model) |
OXI_TOKENIZER | run / chat / serve | tokenizer.json/dir (optional) |
OXI_DIT_GGUF | image | FLUX.2 Klein ternary DiT GGUF |
OXI_VAE_WEIGHTS | image | VAE decoder weights dir |
OXI_TE_4BIT | image | 2.1 GB 4-bit MLX text-encoder model.safetensors |
OXI_TE_TOKENIZER_DIR | image | text-encoder tokenizer dir |
OXI_DIT_ATTN_GPU | image / repl | Enable Metal/CUDA DiT flash-attention (default: on for Metal) |
OXI_VAE_GPU | image / repl | Enable Metal/CUDA VAE decode (default: on for Metal) |
OXI_TE_GPU | image / repl | Enable GPU text-encoder (experimental; default off) |
With .env in place, the flags become optional:
bashoxibonsai run --prompt "Explain ternary quantization in one sentence." oxibonsai image --prompt "a tiny bonsai tree in a ceramic pot" --out bonsai.png
Quick Start
If you installed via
cargo install oxibonsai-cli, start from Step 2.
Theoxibonsaibinary is already on your PATH.
Step 1 — (source builds only) Build
bashcargo build --release export PATH="$PWD/target/release:$PATH"
Step 2 — Get a model
Pick one of the two families (or grab both):
bash# ── Option A: 1-bit Bonsai-8B (1.16 GB pre-quantized GGUF — single curl) ─ mkdir -p models curl -L -o models/Bonsai-8B.gguf \ https://huggingface.co/prism-ml/Bonsai-8B-gguf/resolve/main/Bonsai-8B.gguf # ── Option B: Ternary Bonsai (download safetensors + convert to GGUF) ──── # Fetches unpacked safetensors from HF and runs `oxibonsai convert` # to produce models/Ternary-Bonsai-<size>.gguf + models/tokenizer.json. ./scripts/download_ternary.sh 1.7b # also: 4b | 8b
Ternary prerequisite:
scripts/download_ternary.shuses the
HuggingFacehfCLI — install withpip install huggingface_hub.
Step 3 — Get the tokenizer
A tokenizer is required for all inference commands.
Option B above already downloads it automatically.
For Option A (or cargo install users):
bashoxibonsai tokenizer download # saves to models/tokenizer.json
The tokenizer is pulled from Qwen/Qwen3-8B on HuggingFace (~2.7 MB).
Use --output to save elsewhere, --repo to use a different HF repo.
Step 4 — Run inference
Tip: set
OXI_MODEL(and optionallyOXI_TOKENIZER) in.env
(see Configuration) to omit--model.
bash# 1-bit Bonsai-8B oxibonsai run --model models/Bonsai-8B.gguf \ --prompt "Explain quantum computing in simple terms" \ --max-tokens 512 --temperature 0.7 --top-p 0.9 # Ternary Bonsai (same CLI, different file) oxibonsai run --model models/Ternary-Bonsai-1.7B.gguf \ --prompt "Explain quantum computing in simple terms" \ --max-tokens 512 --temperature 0.7 --top-p 0.9 # Interactive chat, model info, server — all model-agnostic: oxibonsai chat --model models/Bonsai-8B.gguf oxibonsai info --model models/Ternary-Bonsai-1.7B.gguf oxibonsai serve --model models/Ternary-Bonsai-1.7B.gguf \ --host 127.0.0.1 --port 8080 # Interactive image REPL — loads DiT/VAE/TE once, renders many prompts oxibonsai repl --seed 42 --steps 4 --width 512 --height 512 # Convert safetensors → GGUF (HuggingFace unpacked safetensors dir) oxibonsai convert \ --from <unpacked-safetensors-dir> \ --to models/my-model.gguf \ --quant tq2_0_g128 # or q1_0_g128 # Convert ONNX → GGUF (MatMulNBits bits=2, e.g. onnx-community/Ternary-Bonsai-1.7B-ONNX) oxibonsai convert --onnx \ --from path/to/model.onnx \ --to models/my-model.gguf
CLI Smoke & Benchmark Scripts
Two parallel smoke tests — one per quantization family — plus a throughput benchmark and the ternary downloader.
| Script | Target model | Prerequisite | Purpose |
|---|---|---|---|
scripts/cli.sh [metal|cuda] | models/Bonsai-8B.gguf | curl one-liner in Quick Start | Build + end-to-end CLI test on 1-bit Bonsai-8B |
scripts/cli_ternary.sh [metal|cuda|cuda-scirs] | models/Ternary-Bonsai-1.7B.gguf (default; --model to override) | run scripts/download_ternary.sh first | Build + end-to-end CLI test on Ternary Bonsai with a tok/s summary line |
scripts/bench_ternary.sh | models/Ternary-Bonsai-1.7B.gguf | scripts/download_ternary.sh | CPU vs Metal throughput benchmark (averaged over N runs) |
scripts/download_ternary.sh [8b|4b|1.7b] | — | pip install huggingface_hub | Download Ternary Bonsai safetensors from HF and convert to GGUF |
Each CLI script:
- Builds a
--releasebinary with the requested feature flags - Runs inference (
oxibonsai run) - Prints model info (
oxibonsai info) and validates the GGUF (oxibonsai validate) - Reports the measured tok/s
bash# 1-bit flow (Bonsai-8B) ./scripts/cli.sh # CPU SIMD ./scripts/cli.sh metal # Metal GPU (macOS) ./scripts/cli.sh cuda # CUDA GPU (Linux/Windows) # Ternary flow — fetch + convert once, then run as many times as you like ./scripts/download_ternary.sh 1.7b ./scripts/cli_ternary.sh # CPU SIMD ./scripts/cli_ternary.sh metal # Metal GPU — fused TQ2 full-forward path ./scripts/cli_ternary.sh cuda # native CUDA backend ./scripts/bench_ternary.sh # CPU vs Metal, 3-run average + best
Measured Throughput
End-to-end decode, averaged over 3 runs. "fused full-forward" = single GPU command buffer per token.
| Model | Backend | Hardware | tok/s |
|---|---|---|---|
| Ternary-Bonsai-1.7B | Metal (fused TQ2) | Apple Silicon (M-series) | ~50 (best ~57) |
| Ternary-Bonsai-1.7B | CUDA (fused TQ2) | NVIDIA GPU | ~21.9 |
| Ternary-Bonsai-1.7B | CPU SIMD (NEON) | Apple Silicon | ~7–8 |
| Bonsai-8B | Metal (fused Q1) | Apple Silicon (M-series) | ~14.6 |
Numbers come from scripts/bench_ternary.sh / scripts/cli_ternary.sh. CPU baseline varies with thermal and background load; GPU numbers are the steady-state figures.
Configuration
OxiBonsai supports TOML configuration files with --config:
toml[model] path = "models/Ternary-Bonsai-1.7B.gguf" max_seq_len = 4096 [sampling] temperature = 0.7 top_k = 40 top_p = 0.9 repetition_penalty = 1.1 [server] host = "127.0.0.1" port = 8080 [observability] log_level = "info" json_logs = false
Crate Structure
oxibonsai/
├── crates/
│ ├── oxibonsai-core/ GGUF loader, tensor types, config, error types
│ ├── oxibonsai-kernels/ Q1 + TQ2 kernels (dequant, GEMV, GEMM, SIMD tiers,
│ │ tiled, parallel) + GPU backends:
│ │ gpu_backend/metal_* (Metal graph + fused
│ │ full-forward, Q1 & TQ2)
│ │ gpu_backend/cuda_* (native NVRTC kernels)
│ │ gpu_backend/scirs2_backend (scirs2-core CUDA/Metal)
│ ├── oxibonsai-tokenizer/ Pure Rust BPE tokenizer, vocabulary, ChatTemplate
│ ├── oxibonsai-model/ Qwen3 Transformer (GQA, SwiGLU, RoPE, RMSNorm,
│ │ paged KV-cache, Q1 + TQ2 weight loaders)
│ ├── oxibonsai-rag/ RAG pipeline (chunking, embedders, vector store)
│ ├── oxibonsai-runtime/ Inference engine, sampling, OpenAI-compatible server,
│ │ SSE streaming, metrics, health, circuit breaker
│ ├── oxibonsai-eval/ Evaluation harness (ROUGE, perplexity, MMLU)
│ └── oxibonsai-serve/ Standalone server binary
├── src/main.rs CLI entry point (run, chat, serve, info, benchmark,
│ convert, quantize, validate, image, repl)
├── src/cli/
│ ├── repl.rs `oxibonsai repl` — resident `ImageSession` (loads
│ │ DiT/VAE/TE once, renders many prompts); Kitty
│ │ graphics protocol inline display (Ghostty detection)
│ └── term.rs Terminal detection helpers (Ghostty / Kitty protocol)
├── benches/ Criterion kernel benchmarks
├── examples/ Usage examples
├── tests/ Integration + feature flag tests
└── scripts/ Publish, CLI smoke tests, ternary benchmarks
Examples
See the examples/ directory:
basic_inference.rs— Load a model and run single-shot inferencestreaming.rs— Server-sent event streamingcustom_sampling.rs— Custom sampling parameters and presets
bash# 1-bit cargo run --example basic_inference -- --model models/Bonsai-8B.gguf # Ternary cargo run --example basic_inference -- --model models/Ternary-Bonsai-1.7B.gguf
COOLJAPAN Ecosystem
OxiBonsai (Pure Rust sub-2-bit LLM inference — Q1 + TQ2, CPU + Metal + CUDA)
├── SciRS2 v0.4.x (tensor primitives, activation functions)
├── OxiBLAS v0.2.x (GEMM/GEMV + 1-bit/ternary compute kernels)
├── OxiFFT v0.2.x (optional RoPE acceleration)
└── NumRS2 v0.3.x (N-dimensional array backend)
All default-feature dependencies are Pure Rust — zero C/C++/Fortran, zero FFI. GPU backends (metal, native-cuda, cuda) are opt-in features that bring in vendor drivers.
Development Roadmap
| Phase | Description | Status |
|---|---|---|
| Phase 0 | Foundation (workspace, GGUF loader, metadata) | ✅ |
| Phase 1 | 1-Bit Kernels (dequant, GEMV, GEMM) | ✅ |
| Phase 2 | Transformer Engine (Qwen3-8B forward pass) | ✅ |
| Phase 3 | Inference Runtime (KV cache, sampling, CLI) | ✅ |
| Phase 4 | Production Hardening (SIMD, parallel, tests, observability) | ✅ |
| Phase 5 | Ecosystem Integration (SSE streaming, WASM, API, Bonsai family) | ✅ |
| Phase 6 | Advanced Infrastructure (Multi-GPU, CUDA/Metal, PagedAttention) | ✅ |
| Phase 7 | Production Features (model merging, flash decoding, RAG, eval) | ✅ |
| Phase 8 | Final Polish (K-quant, streaming GGUF, kernel tuning, tests) | ✅ |
| Phase 9 | Ternary Bonsai (TQ2_0_g128 kernels, model variants, GGUF surface, export) | ✅ |
| Phase 10 | Ternary CPU SIMD tiers (AVX2 / AVX-512 / NEON TQ2 GEMV) | ✅ |
| Phase 11 | Metal TQ2 GEMV + per-kernel dispatch | ✅ |
| Phase 12 | Native CUDA backend (NVRTC, fused Q1 + TQ2 full-forward) | ✅ |
| Phase 13.x | Fused Metal TQ2 full-forward (single command buffer, ~13× speedup on 1.7B) | ✅ |
| Phase 13.y | Ternary LM head on GPU — closes all 7 OutputWeight::Ternary guard sites (4 Metal + 3 CUDA); +5 tok/s on Metal | ✅ |
Sponsorship
OxiBonsai is developed and maintained by COOLJAPAN OU (Team Kitasan).
The COOLJAPAN Ecosystem represents one of the largest Pure Rust scientific computing efforts in existence — spanning 40+ projects, 500+ crates, and millions of lines of Rust code across scientific computing, machine learning, quantum computing, geospatial analysis, legal technology, multimedia processing, and more. Every line is written and maintained by a small dedicated team committed to a C/Fortran-free future for scientific software.
If you find OxiBonsai or any COOLJAPAN project useful, please consider sponsoring to support continued development.
https://github.com/sponsors/cool-japan
Your sponsorship helps us:
- Maintain and expand the COOLJAPAN ecosystem (40+ projects, 500+ crates)
- Keep the entire stack 100% Pure Rust — no C/Fortran/system library dependencies
- Develop production-grade alternatives to OpenCV, FFmpeg, SciPy, NumPy, scikit-learn, PyTorch, TensorFlow, GDAL, and more
- Provide long-term support, security updates, and documentation
- Fund research into novel Rust-native algorithms and optimizations
License
Apache License, Version 2.0
Copyright 2026 COOLJAPAN OU (Team KitaSan)
Contributors
Showing top 1 contributor by commit count.
