GitPedia

Oxibonsai

OxiBonsai is a zero-FFI, zero-C/C++ inference engine for PrismML's sub-2-bit Bonsai family — both the 1-bit line (Q1_0_g128) and the ternary line (TQ2_0_g128). It runs on CPU (SIMD), Apple Silicon (Metal), and NVIDIA (CUDA) without depending on llama.cpp, BLAS, or any C/Fortran runtime.

From cool-japan·Updated June 23, 2026·View on GitHub·

**Pure Rust Sub-2-Bit LLM Inference Engine for PrismML Bonsai Models** The project is written primarily in Rust, distributed under the Apache License 2.0 license, first published in 2026. Key topics include: bonsai, pure-rust, qwen3, rust, rust-crate.

Latest release: v0.2.2OxiBonsai 0.2.2 Release
June 8, 2026View Changelog →

OxiBonsai

(オキシ盆栽)

Pure Rust Sub-2-Bit LLM Inference Engine for PrismML Bonsai Models

License
Rust

OxiBonsai is a zero-FFI, zero-C/C++ inference engine for PrismML's sub-2-bit Bonsai family — both the 1-bit line (Q1_0_g128) and the ternary line (TQ2_0_g128). It runs on CPU (SIMD), Apple Silicon (Metal), and NVIDIA (CUDA) without depending on llama.cpp, BLAS, or any C/Fortran runtime. Built entirely on the COOLJAPAN ecosystem — SciRS2, OxiBLAS, OxiFFT — it delivers sovereign AI inference in Pure Rust.

To our knowledge, OxiBonsai is the first pure-Rust — C/C++/Fortran-free, zero-FFI — inference engine for the Bonsai 1-bit/ternary model family, and the first to bring its FLUX.2-Klein text-to-image (Bonsai-Image) to pure Rust, built entirely on the COOLJAPAN ecosystem.

Documentation

  • CLI reference — every oxibonsai and oxibonsai-serve subcommand, flag, and environment variable.
  • Image-generation guide — end-to-end Bonsai-Image (FLUX.2-Klein) text-to-image walkthrough.

Status

Version 0.2.2 — 2026-06-08 · 4,671 tests passing · ~177k lines of Rust · Pure Rust

CrateStatusTests
oxibonsai-coreStable207
oxibonsai-kernelsStable675
oxibonsai-modelStable673
oxibonsai-runtimeStable796
oxibonsai-tokenizerStable206
oxibonsai-ragStable871
oxibonsai-evalStable513
oxibonsai-serveStable260
oxibonsai-imageStable72
oxibonsai (facade)Stable352

Features

Sub-2-Bit Native Inference

Two native quantization families, each with dedicated dequant / GEMV / full-forward kernels:

FamilyEncodingBits/weightBlock sizeExample models
1-bitQ1_0_g1281.0128 weights, FP16 group scaleBonsai-8B
TernaryTQ2_0_g128≈1.585128 weights / 34 B, FP16 scaleTernary-Bonsai-8B / 4B / 1.7B
  • Full Qwen3 architecture: multi-layer decoder, GQA, SwiGLU, RoPE, RMSNorm
  • {-1, 0, +1} ternary encoding: 0b00→−1, 0b01→0, 0b10→+1, 0b11→0
  • Correctness gate: at --temperature 0 --seed 42, CPU and Metal produce byte-identical output

Acceleration Tiers

TierTargetWidth / DeviceFeature Flag
ReferenceAll platformsScalar(default)
AVX2 + FMAx86-64256-bitsimd-avx2
AVX-512x86-64512-bitsimd-avx512
NEONAArch64128-bitsimd-neon
MetalApple SiliconGPU, fused full-forwardmetal
CUDA (native)NVIDIA GPUGPU, NVRTC kernelsnative-cuda
CUDA (scirs2)NVIDIA GPUGPU via scirs2-corecuda

Auto-detection via KernelDispatcher::auto_detect() selects the best CPU tier at runtime. GPU backends are opt-in at build time.

Note on CPU tiers: The CPU tier is chosen entirely at runtime via is_x86_feature_detected! — the dispatcher picks AVX-512 only when AVX-512F+BW+VL are all present, otherwise AVX2+FMA, otherwise the scalar reference path. Each SIMD function carries a per-function #[target_feature(...)] attribute, so a single x86-64 binary is safe on every x86-64 CPU and automatically falls back (AVX-512 → AVX-2 → scalar) with no SIGILL. The simd-avx2 / simd-avx512 / simd-neon Feature Flags above are accepted for compatibility but do not gate tier selection — all tiers are always compiled in and chosen at runtime.

AVX-512 has been absent from Intel consumer CPUs since Alder Lake (Raptor Lake, Meteor Lake, Arrow Lake and Lunar Lake have none); it mainly benefits Xeon / HEDT and AMD Zen 4+. On consumer hardware the AVX-2 tier is selected automatically.

There is currently no INT8 dot-product tier (AVX-VNNI vpdpbusd / NEON-UDOT vdotq_s32): the 1-bit and ternary kernels expand weights to ±scale and accumulate in FP32 FMA. An INT8 dot-product tier — which would require quantizing activations to INT8 — is a possible future enhancement.

Fused GPU Full-Forward Path

Both the 1-bit and ternary forward passes are encoded into a single GPU command buffer rather than one submission per GEMV. Per-layer dispatch sequence:

  1. Pre-attn RMSNorm
  2. Fused QKV GEMV (Q ‖ K ‖ V concatenated in weight SoA)
  3. Fused QK-norm + RoPE
  4. Fused KV-store
  5. Batched attention: scores V2 → softmax → weighted-sum
  6. Attn output GEMV + residual add
  7. FFN RMSNorm
  8. Gate + Up GEMV (gate ‖ up concatenated)
  9. Batched SwiGLU
  10. Down GEMV + residual add

= 14 dispatches/layer × N layers per command buffer. This is what unlocks the Metal and CUDA throughput numbers below.

Observability

  • Structured logging via tracing with env-filter and JSON output
  • Inference metrics: tokens/sec, prefill/decode latency, request counts
  • Health endpoint (/health) with readiness checks
  • Circuit breaker for overload protection
  • Per-request tracing IDs via RequestId (RFC 4122 UUIDv4, no uuid crate dependency)
  • Per-request rate metrics via RequestRateTracker — TBT p50/p95, EWMA tokens/sec, queue-wait
  • Workload aggregatorRequestRateAggregator rolls per-request snapshots into oxibonsai_request_tokens_per_second, oxibonsai_inter_token_latency_p50/p95_seconds, and oxibonsai_queue_wait_seconds Prometheus gauges

Runtime Controllers (0.1.4)

Two adaptive controllers shipped in 0.1.4 let the runtime self-tune as the workload changes:

rust
use oxibonsai_runtime::{KvCachePolicy, AdaptiveLookahead, AdaptiveLookaheadConfig}; // KV cache policy: FP16 ↔ Q8 ↔ Q4 driven by EWMA pressure with hysteresis. let kv = KvCachePolicy::default(); let level = kv.observe(0.92); // → escalates to Q8 once smoothed pressure crosses 0.80 // Speculative-decoding draft length: continuously updated from acceptance EWMA. let mut k = AdaptiveLookahead::new(AdaptiveLookaheadConfig::default()); k.observe_step(5, 4); // proposed=5, accepted=4 → k drifts toward 5

A worked end-to-end example lives in examples/runtime_controllers.rs:

bash
cargo run --example runtime_controllers

OpenAI-Compatible API

  • /v1/chat/completions endpoint (POST)
  • Streaming SSE support for real-time token output
  • /v1/models endpoint
  • CORS and tower middleware

Builder Pattern API

rust
use oxibonsai_runtime::{EngineBuilder, SamplingPreset}; let engine = EngineBuilder::new() .model_path("models/Ternary-Bonsai-1.7B.gguf") .preset(SamplingPreset::Balanced) .max_seq_len(4096) .build()?;

Sampling Presets

PresetTemperatureTop-KTop-PUse Case
Greedy0.011.0Deterministic
Balanced0.7400.9General
Creative1.01000.95Creative writing
Code0.2100.8Code generation

Bonsai Model Family

OxiBonsai supports PrismML's full Bonsai lineup across both quantization families:

ModelArchParamsFormatSizeContext
Bonsai-8BQwen3-8B8.19 BQ1_0_g1281.15 GB65,536
Ternary-Bonsai-8BQwen3-8B8.19 BTQ2_0_g128~1.75 GB65,536
Ternary-Bonsai-4BQwen3-4B~4 BTQ2_0_g128~900 MB65,536
Ternary-Bonsai-1.7BQwen3-1.7B~1.7 BTQ2_0_g128~390 MB65,536

Ternary weights trade roughly +600 MB (at 8B scale) for ~5 additional benchmark points over the 1-bit line. All models share the same Qwen3 architecture (GQA, SwiGLU, RoPE, RMSNorm), so the runtime, tokenizer, and server are identical across the family.

Note: PrismML publishes Ternary Bonsai as unpacked safetensors. Use scripts/download_ternary.sh (or oxibonsai convert --quant tq2_0_g128) to fetch and repack as GGUF before loading. An onnx-community ONNX release (MatMulNBits bits=2) is also supported via oxibonsai convert --onnx.

Installation

bash
cargo install oxibonsai-cli

This installs the oxibonsai binary. Rust 1.86+ required.

Library (for Rust projects)

toml
[dependencies] oxibonsai = "0.2.2"

Build from source (for development)

bash
git clone https://github.com/cool-japan/oxibonsai cd oxibonsai cargo build --release # binary at: target/release/oxibonsai

Configuration (.env)

The CLI auto-loads a .env file from the current directory (or any parent), so you can
omit the model/path flags. Precedence: --flag > shell env var > .env > built-in default.

bash
# Fetch the template from GitHub … curl -fsSL https://raw.githubusercontent.com/cool-japan/oxibonsai/master/.env.example -o .env # … or, in a source checkout: cp .env.example .env # Edit .env to point at your model files $EDITOR .env

Keys:

KeyUsed byPurpose
OXI_MODELrun / chat / serve / infoGGUF model path (omit --model)
OXI_TOKENIZERrun / chat / servetokenizer.json/dir (optional)
OXI_DIT_GGUFimageFLUX.2 Klein ternary DiT GGUF
OXI_VAE_WEIGHTSimageVAE decoder weights dir
OXI_TE_4BITimage2.1 GB 4-bit MLX text-encoder model.safetensors
OXI_TE_TOKENIZER_DIRimagetext-encoder tokenizer dir
OXI_DIT_ATTN_GPUimage / replEnable Metal/CUDA DiT flash-attention (default: on for Metal)
OXI_VAE_GPUimage / replEnable Metal/CUDA VAE decode (default: on for Metal)
OXI_TE_GPUimage / replEnable GPU text-encoder (experimental; default off)

With .env in place, the flags become optional:

bash
oxibonsai run --prompt "Explain ternary quantization in one sentence." oxibonsai image --prompt "a tiny bonsai tree in a ceramic pot" --out bonsai.png

Quick Start

If you installed via cargo install oxibonsai-cli, start from Step 2.
The oxibonsai binary is already on your PATH.

Step 1 — (source builds only) Build

bash
cargo build --release export PATH="$PWD/target/release:$PATH"

Step 2 — Get a model

Pick one of the two families (or grab both):

bash
# ── Option A: 1-bit Bonsai-8B (1.16 GB pre-quantized GGUF — single curl) ─ mkdir -p models curl -L -o models/Bonsai-8B.gguf \ https://huggingface.co/prism-ml/Bonsai-8B-gguf/resolve/main/Bonsai-8B.gguf # ── Option B: Ternary Bonsai (download safetensors + convert to GGUF) ──── # Fetches unpacked safetensors from HF and runs `oxibonsai convert` # to produce models/Ternary-Bonsai-<size>.gguf + models/tokenizer.json. ./scripts/download_ternary.sh 1.7b # also: 4b | 8b

Ternary prerequisite: scripts/download_ternary.sh uses the
HuggingFace hf CLI — install with pip install huggingface_hub.

Step 3 — Get the tokenizer

A tokenizer is required for all inference commands.
Option B above already downloads it automatically.
For Option A (or cargo install users):

bash
oxibonsai tokenizer download # saves to models/tokenizer.json

The tokenizer is pulled from Qwen/Qwen3-8B on HuggingFace (~2.7 MB).
Use --output to save elsewhere, --repo to use a different HF repo.

Step 4 — Run inference

Tip: set OXI_MODEL (and optionally OXI_TOKENIZER) in .env
(see Configuration) to omit --model.

bash
# 1-bit Bonsai-8B oxibonsai run --model models/Bonsai-8B.gguf \ --prompt "Explain quantum computing in simple terms" \ --max-tokens 512 --temperature 0.7 --top-p 0.9 # Ternary Bonsai (same CLI, different file) oxibonsai run --model models/Ternary-Bonsai-1.7B.gguf \ --prompt "Explain quantum computing in simple terms" \ --max-tokens 512 --temperature 0.7 --top-p 0.9 # Interactive chat, model info, server — all model-agnostic: oxibonsai chat --model models/Bonsai-8B.gguf oxibonsai info --model models/Ternary-Bonsai-1.7B.gguf oxibonsai serve --model models/Ternary-Bonsai-1.7B.gguf \ --host 127.0.0.1 --port 8080 # Interactive image REPL — loads DiT/VAE/TE once, renders many prompts oxibonsai repl --seed 42 --steps 4 --width 512 --height 512 # Convert safetensors → GGUF (HuggingFace unpacked safetensors dir) oxibonsai convert \ --from <unpacked-safetensors-dir> \ --to models/my-model.gguf \ --quant tq2_0_g128 # or q1_0_g128 # Convert ONNX → GGUF (MatMulNBits bits=2, e.g. onnx-community/Ternary-Bonsai-1.7B-ONNX) oxibonsai convert --onnx \ --from path/to/model.onnx \ --to models/my-model.gguf

CLI Smoke & Benchmark Scripts

Two parallel smoke tests — one per quantization family — plus a throughput benchmark and the ternary downloader.

ScriptTarget modelPrerequisitePurpose
scripts/cli.sh [metal|cuda]models/Bonsai-8B.ggufcurl one-liner in Quick StartBuild + end-to-end CLI test on 1-bit Bonsai-8B
scripts/cli_ternary.sh [metal|cuda|cuda-scirs]models/Ternary-Bonsai-1.7B.gguf (default; --model to override)run scripts/download_ternary.sh firstBuild + end-to-end CLI test on Ternary Bonsai with a tok/s summary line
scripts/bench_ternary.shmodels/Ternary-Bonsai-1.7B.ggufscripts/download_ternary.shCPU vs Metal throughput benchmark (averaged over N runs)
scripts/download_ternary.sh [8b|4b|1.7b]pip install huggingface_hubDownload Ternary Bonsai safetensors from HF and convert to GGUF

Each CLI script:

  1. Builds a --release binary with the requested feature flags
  2. Runs inference (oxibonsai run)
  3. Prints model info (oxibonsai info) and validates the GGUF (oxibonsai validate)
  4. Reports the measured tok/s
bash
# 1-bit flow (Bonsai-8B) ./scripts/cli.sh # CPU SIMD ./scripts/cli.sh metal # Metal GPU (macOS) ./scripts/cli.sh cuda # CUDA GPU (Linux/Windows) # Ternary flow — fetch + convert once, then run as many times as you like ./scripts/download_ternary.sh 1.7b ./scripts/cli_ternary.sh # CPU SIMD ./scripts/cli_ternary.sh metal # Metal GPU — fused TQ2 full-forward path ./scripts/cli_ternary.sh cuda # native CUDA backend ./scripts/bench_ternary.sh # CPU vs Metal, 3-run average + best

Measured Throughput

End-to-end decode, averaged over 3 runs. "fused full-forward" = single GPU command buffer per token.

ModelBackendHardwaretok/s
Ternary-Bonsai-1.7BMetal (fused TQ2)Apple Silicon (M-series)~50 (best ~57)
Ternary-Bonsai-1.7BCUDA (fused TQ2)NVIDIA GPU~21.9
Ternary-Bonsai-1.7BCPU SIMD (NEON)Apple Silicon~7–8
Bonsai-8BMetal (fused Q1)Apple Silicon (M-series)~14.6

Numbers come from scripts/bench_ternary.sh / scripts/cli_ternary.sh. CPU baseline varies with thermal and background load; GPU numbers are the steady-state figures.

Configuration

OxiBonsai supports TOML configuration files with --config:

toml
[model] path = "models/Ternary-Bonsai-1.7B.gguf" max_seq_len = 4096 [sampling] temperature = 0.7 top_k = 40 top_p = 0.9 repetition_penalty = 1.1 [server] host = "127.0.0.1" port = 8080 [observability] log_level = "info" json_logs = false

Crate Structure

oxibonsai/
├── crates/
│   ├── oxibonsai-core/        GGUF loader, tensor types, config, error types
│   ├── oxibonsai-kernels/     Q1 + TQ2 kernels (dequant, GEMV, GEMM, SIMD tiers,
│   │                          tiled, parallel) + GPU backends:
│   │                            gpu_backend/metal_*       (Metal graph + fused
│   │                                                       full-forward, Q1 & TQ2)
│   │                            gpu_backend/cuda_*        (native NVRTC kernels)
│   │                            gpu_backend/scirs2_backend (scirs2-core CUDA/Metal)
│   ├── oxibonsai-tokenizer/   Pure Rust BPE tokenizer, vocabulary, ChatTemplate
│   ├── oxibonsai-model/       Qwen3 Transformer (GQA, SwiGLU, RoPE, RMSNorm,
│   │                          paged KV-cache, Q1 + TQ2 weight loaders)
│   ├── oxibonsai-rag/         RAG pipeline (chunking, embedders, vector store)
│   ├── oxibonsai-runtime/     Inference engine, sampling, OpenAI-compatible server,
│   │                          SSE streaming, metrics, health, circuit breaker
│   ├── oxibonsai-eval/        Evaluation harness (ROUGE, perplexity, MMLU)
│   └── oxibonsai-serve/       Standalone server binary
├── src/main.rs                CLI entry point (run, chat, serve, info, benchmark,
│                              convert, quantize, validate, image, repl)
├── src/cli/
│   ├── repl.rs                `oxibonsai repl` — resident `ImageSession` (loads
│   │                          DiT/VAE/TE once, renders many prompts); Kitty
│   │                          graphics protocol inline display (Ghostty detection)
│   └── term.rs                Terminal detection helpers (Ghostty / Kitty protocol)
├── benches/                   Criterion kernel benchmarks
├── examples/                  Usage examples
├── tests/                     Integration + feature flag tests
└── scripts/                   Publish, CLI smoke tests, ternary benchmarks

Examples

See the examples/ directory:

  • basic_inference.rs — Load a model and run single-shot inference
  • streaming.rs — Server-sent event streaming
  • custom_sampling.rs — Custom sampling parameters and presets
bash
# 1-bit cargo run --example basic_inference -- --model models/Bonsai-8B.gguf # Ternary cargo run --example basic_inference -- --model models/Ternary-Bonsai-1.7B.gguf

COOLJAPAN Ecosystem

OxiBonsai (Pure Rust sub-2-bit LLM inference — Q1 + TQ2, CPU + Metal + CUDA)
  ├── SciRS2 v0.4.x     (tensor primitives, activation functions)
  ├── OxiBLAS v0.2.x    (GEMM/GEMV + 1-bit/ternary compute kernels)
  ├── OxiFFT v0.2.x     (optional RoPE acceleration)
  └── NumRS2 v0.3.x     (N-dimensional array backend)

All default-feature dependencies are Pure Rust — zero C/C++/Fortran, zero FFI. GPU backends (metal, native-cuda, cuda) are opt-in features that bring in vendor drivers.

Development Roadmap

PhaseDescriptionStatus
Phase 0Foundation (workspace, GGUF loader, metadata)
Phase 11-Bit Kernels (dequant, GEMV, GEMM)
Phase 2Transformer Engine (Qwen3-8B forward pass)
Phase 3Inference Runtime (KV cache, sampling, CLI)
Phase 4Production Hardening (SIMD, parallel, tests, observability)
Phase 5Ecosystem Integration (SSE streaming, WASM, API, Bonsai family)
Phase 6Advanced Infrastructure (Multi-GPU, CUDA/Metal, PagedAttention)
Phase 7Production Features (model merging, flash decoding, RAG, eval)
Phase 8Final Polish (K-quant, streaming GGUF, kernel tuning, tests)
Phase 9Ternary Bonsai (TQ2_0_g128 kernels, model variants, GGUF surface, export)
Phase 10Ternary CPU SIMD tiers (AVX2 / AVX-512 / NEON TQ2 GEMV)
Phase 11Metal TQ2 GEMV + per-kernel dispatch
Phase 12Native CUDA backend (NVRTC, fused Q1 + TQ2 full-forward)
Phase 13.xFused Metal TQ2 full-forward (single command buffer, ~13× speedup on 1.7B)
Phase 13.yTernary LM head on GPU — closes all 7 OutputWeight::Ternary guard sites (4 Metal + 3 CUDA); +5 tok/s on Metal

Sponsorship

OxiBonsai is developed and maintained by COOLJAPAN OU (Team Kitasan).

The COOLJAPAN Ecosystem represents one of the largest Pure Rust scientific computing efforts in existence — spanning 40+ projects, 500+ crates, and millions of lines of Rust code across scientific computing, machine learning, quantum computing, geospatial analysis, legal technology, multimedia processing, and more. Every line is written and maintained by a small dedicated team committed to a C/Fortran-free future for scientific software.

If you find OxiBonsai or any COOLJAPAN project useful, please consider sponsoring to support continued development.

Sponsor

https://github.com/sponsors/cool-japan

Your sponsorship helps us:

  • Maintain and expand the COOLJAPAN ecosystem (40+ projects, 500+ crates)
  • Keep the entire stack 100% Pure Rust — no C/Fortran/system library dependencies
  • Develop production-grade alternatives to OpenCV, FFmpeg, SciPy, NumPy, scikit-learn, PyTorch, TensorFlow, GDAL, and more
  • Provide long-term support, security updates, and documentation
  • Fund research into novel Rust-native algorithms and optimizations

License

Apache License, Version 2.0

Copyright 2026 COOLJAPAN OU (Team KitaSan)

Contributors

Showing top 1 contributor by commit count.

View all contributors on GitHub →

This article is auto-generated from cool-japan/oxibonsai via the GitHub API.Last fetched: 6/27/2026