GitPedia

Parakeet.cpp

Ultra fast and portable Parakeet implementation for on-device inference in C++ using Axiom with MPS+Unified Memory

From Frikallo·Updated June 24, 2026·View on GitHub·

Fast speech recognition with NVIDIA's [Parakeet](https://huggingface.co/collections/nvidia/parakeet) models in pure C++. The project is written primarily in C++, distributed under the MIT License license, first published in 2026. Key topics include: asr, automatic-speech-recognition, axiom, nvidia, parakeet.

parakeet.cpp

Fast speech recognition with NVIDIA's Parakeet models in pure C++.

Built on axiom — a lightweight tensor library with automatic Metal GPU acceleration. No ONNX runtime, no Python runtime, no heavyweight dependencies. Just C++ and one tensor library that outruns PyTorch MPS.

~27ms encoder inference on Apple Silicon GPU for 10s audio (110M model) — 96x faster than CPU. FP16 support for ~2x memory reduction.

Supported Models

ModelClassSizeTypeDescription
tdt-ctc-110mParakeetTDTCTC110MOfflineEnglish, dual CTC/TDT decoder heads
tdt-600mParakeetTDT600MOfflineMultilingual, TDT decoder
eou-120mParakeetEOU120MStreamingEnglish, RNNT with end-of-utterance detection
nemotron-600mParakeetNemotron600MStreamingMultilingual, configurable latency (80ms–1120ms)
sortformerSortformer117MStreamingSpeaker diarization (up to 4 speakers)
diarizedDiarizedTranscriber110M+117MOfflineASR + diarization → speaker-attributed words

All ASR models share the same audio pipeline: 16kHz mono WAV → 80-bin Mel spectrogram → FastConformer encoder.

Quick Start

cpp
#include <parakeet/parakeet.hpp> parakeet::Transcriber t("model.safetensors", "vocab.txt"); t.to_gpu(); // optional — Metal acceleration t.to_half(); // optional — FP16 inference (~2x memory reduction) auto result = t.transcribe("audio.wav"); std::cout << result.text << std::endl;

Features

  • Multiple decoders — CTC greedy, TDT greedy, CTC beam search, TDT beam search (switch at call site)
  • Word timestamps — Per-word start/end times and confidence scores on all decoders
  • Beam search + LM — CTC and TDT beam search with optional ARPA n-gram language model fusion
  • Phrase boosting — Context biasing via token-level trie for domain-specific vocabulary
  • Batch transcription — Multiple files in one batched encoder forward pass
  • VAD preprocessing — Silero VAD strips silence before ASR; timestamps auto-remapped
  • GPU acceleration — Metal via axiom's MPSGraph compiler (96x speedup on Apple Silicon)
  • FP16 inference — Half-precision weights and compute (~2x memory reduction)
  • Streaming — EOU and Nemotron models with chunked audio input
  • Speaker diarization — Sortformer (up to 4 speakers), combinable with ASR for speaker-attributed words
  • C API — Flat extern "C" FFI for Python, Swift, Go, Rust, and other languages
  • Multi-format audio — WAV, FLAC, MP3, OGG with automatic resampling

See examples/ for code demonstrating each feature.

Install

Prebuilt binaries are attached to each GitHub release for macOS arm64, macOS x86_64, and Linux x86_64. Download the tarball for your platform and extract:

bash
tar -xzf parakeet-v0.1.0-macos-arm64.tar.gz cd parakeet-v0.1.0-macos-arm64 # On macOS, clear the Gatekeeper quarantine attribute first: xattr -dr com.apple.quarantine . ./bin/parakeet --help

The archive ships a self-contained bin/parakeet (and bin/example-server) plus lib/libaxiom with @rpath/$ORIGIN set so the binaries resolve their dependencies relative to the install dir — drop the directory anywhere. The C-API headers under include/parakeet/ are included for embedders.

Build from source

bash
git clone --recursive https://github.com/frikallo/parakeet.cpp cd parakeet.cpp make build make test

Requirements: C++20 (Clang 14+ or GCC 12+), CMake 3.20+, macOS 13+ for Metal GPU.

macOS: building requires the full Xcode install (not just Command Line Tools), because axiom compiles its Metal shaders with xcrun metal and xcrun metallib — those ship only with Xcode. If you just want to run parakeet, use the prebuilt tarball above; the .metallib is embedded into the shipped libaxiom.dylib and runs without any Xcode/CLT install on the user side.

Convert Weights

bash
# Download from HuggingFace huggingface-cli download nvidia/parakeet-tdt_ctc-110m --include "*.nemo" --local-dir . # Convert to safetensors pip install safetensors torch python scripts/convert_nemo.py parakeet-tdt_ctc-110m.nemo -o model.safetensors

The converter supports all model types: 110m-tdt-ctc (default), 600m-tdt, eou-120m, nemotron-600m, sortformer.

bash
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 600m-tdt

Silero VAD weights:

bash
python scripts/convert_silero_vad.py -o silero_vad_v5.safetensors

Examples

ExampleDescription
basicSimplest transcription (~20 lines)
timestampsWord/token timestamps with confidence
beam-searchCTC/TDT beam search with optional ARPA LM
phrase-boostContext biasing for domain vocabulary
batchBatch transcription of multiple files
vadStandalone VAD and ASR+VAD preprocessing
gpuMetal GPU + FP16 with timing comparison
streamEOU streaming transcription
nemotronNemotron streaming with latency modes
diarizeSortformer speaker diarization
diarized-transcriptionASR + diarization combined
c-apiPure C99 FFI usage
cliFull CLI with all options

Using as a Library

CMake find_package

After installing (make install or cmake --install build):

cmake
find_package(Parakeet REQUIRED) target_link_libraries(myapp PRIVATE Parakeet::parakeet)

CMake add_subdirectory

cmake
add_subdirectory(third_party/parakeet.cpp) target_link_libraries(myapp PRIVATE Parakeet::parakeet)

pkg-config

bash
g++ -std=c++20 myapp.cpp $(pkg-config --cflags --libs parakeet) -o myapp

Architecture

Offline Models

Built on a shared FastConformer encoder (Conv2d 8x subsampling → N Conformer blocks with relative positional attention):

ModelClassDecoderUse case
CTCParakeetCTCGreedy argmax or beam search (+LM)Fast, English-only
RNNTParakeetRNNTAutoregressive LSTMStreaming capable
TDTParakeetTDTLSTM + duration prediction, greedy or beam search (+LM)Better accuracy than RNNT
TDT-CTCParakeetTDTCTCBoth TDT and CTC headsSwitch decoder at inference

Streaming Models

Built on a cache-aware streaming FastConformer encoder with causal convolutions and bounded-context attention:

ModelClassDecoderUse case
EOUParakeetEOUStreaming RNNTEnd-of-utterance detection
NemotronParakeetNemotronStreaming TDTConfigurable latency streaming

Diarization

ModelClassArchitectureUse case
SortformerSortformerNEST encoder → Transformer → sigmoidSpeaker diarization (up to 4 speakers)

Benchmarks

Measured on Apple M3 16GB with simulated audio input (Tensor::randn). Times are per-encoder-forward-pass (Sortformer: full forward pass).

Encoder throughput — 10s audio:

ModelParamsCPU (ms)GPU (ms)GPU Speedup
110m (TDT-CTC)110M2,5812796x
tdt-600m600M10,77952021x
rnnt-600m600M10,6481,4687x
sortformer117M3,1954797x

110m GPU scaling across audio lengths:

AudioCPU (ms)GPU (ms)RTFThroughput
1s262240.02441x
5s1,222260.005190x
10s2,581270.003370x
30s10,061320.001935x
60s26,559720.001833x

GPU acceleration powered by axiom's Metal graph compiler which fuses the full encoder into optimized MPSGraph operations.

bash
make bench ARGS="--110m=models/model.safetensors --tdt-600m=models/tdt.safetensors"

Roadmap

Tier 1 — High Impact

  • Confidence scores — Per-token and per-word confidence from token log-probs
  • Phrase boosting — Token-level trie context biasing during decode
  • Beam search — CTC prefix beam search and TDT time-synchronous beam search
  • N-gram LM fusion — ARPA language models scored at word boundaries

Audio & I/O

  • Multi-format audio — WAV, FLAC, MP3, OGG via dr_libs + stb_vorbis
  • Automatic resampling — Windowed sinc interpolation (Kaiser, 16-tap)
  • Load from memoryread_audio(bytes, len), float/int16 buffers
  • Audio duration query — Header-only duration without full decode
  • Progress callbacks — Stage reporting for long files
  • Streaming from raw PCM — Direct microphone buffer feeding

Tier 2 — Production Readiness

  • Diarized transcription — ASR + Sortformer → speaker-attributed words
  • VAD — Silero VAD v5, standalone + ASR preprocessing
  • Batch inference — Padded multi-file encoder forward pass
  • Long-form chunking — Overlapping windows for audio >30s
  • Neural LM rescoring — N-best reranking with Transformer LM

Tier 3 — Ecosystem

  • C API — Flat C interface for FFI from any language
  • FP16 inference — Half-precision weights and compute
  • Model quantization — INT8/INT4 for mobile deployment
  • Hotword detection — Trigger phrase detection
  • Speaker embeddings — Speaker verification from Sortformer/TitaNet

Notes

  • Audio: 16kHz mono (WAV, FLAC, MP3, OGG — auto-detected and resampled)
  • Offline models have ~4-5 minute audio length limits; use streaming models for longer audio
  • GPU acceleration requires Apple Silicon with Metal support

License

MIT

Contributors

Showing top 2 contributors by commit count.

View all contributors on GitHub →

This article is auto-generated from Frikallo/parakeet.cpp via the GitHub API.Last fetched: 6/26/2026