GitPedia

TheWhisper

Optimized Whisper models for streaming and on-device use

From TheStageAIยทUpdated June 15, 2026ยทView on GitHubยท

This repository aims to share and develop the most efficient speech-to-text and text-to-speech inference solution -with a strong focus on self-hosting, cloud hosting, and on-device inference across multiple devices. The project is written primarily in Python, distributed under the MIT License license, first published in 2025. Key topics include: apple-silicon, coreml, mlx, nvidia-gpu, on-device-ai.

TheWhisper: High-Performance Speech-to-Text

License: MIT
Hugging Face
NVIDIA
Apple Silicon

<img width="1420" height="939" alt="Frame 339234 (2)" src="https://github.com/user-attachments/assets/e4549998-9d83-4980-bf53-cd21d40e9bce" />

๐Ÿš€ Overview

This repository aims to share and develop the most efficient speech-to-text and text-to-speech inference solution -with a strong focus on self-hosting, cloud hosting, and on-device inference across multiple devices.

For the first release this repository provides open-source transcription models with streaming inference support and:

  • Hugging Face open weights for whisper models with a flexible chunk size (original models have 30s)
  • High-performance TheStage AI inference engines (NVIDIA GPU), 220 tok/s on L40s for whisper-large-v3 model
  • CoreML engines for macOS / Apple Silicon with the lowest in the world power consumption for MacOS
  • Local RestAPI with frontend examples using JS and Electron see for details
  • Electron demo app built by TheStage AI (Certified by Apple): TheNotes for macOS
  • Tutorial on building local note-taking app for macOS using Electron and TheWhisper

https://github.com/user-attachments/assets/f4d3fe7b-e2c5-42ff-a5d0-fef6afd11684

It is optimized for low-latency, low power usage, and scalable streaming transcription. Ideal for real-time captioning, live meetings, voice interfaces, and edge deployments.

<!-- <details> <summary><strong>๐Ÿ“– Table of Contents</strong></summary> -->

๐Ÿ“– Table of Contents

<!-- </details> -->

โœจ Features

  • Open weights fine-tuned versions of Whisper models
  • Fine-tuned models support inference with 10s, 15s, 20s and 30s
  • CoreML engines for macOS and Apple Silicon, ~2W of power consumption, ~2GB RAM usage
  • Optimized engines for NVIDIA GPUs through TheStage AI ElasticModels (free for small orgs)
  • Streaming implementation (NVIDIA + macOS)
  • Benchmarks: latency, memory, power, and ASR accuracy (OpenASR)
  • Simple Python API, examples and tutorial of deployment for MacOS desktop app with Electron and ReactJS
<img width="1547" height="877" alt="apple m2 whisper (4)" src="https://github.com/user-attachments/assets/9404cdc0-b120-4ba1-9c65-4d42089ba623" /> <img width="1547" height="877" alt="nvidia l40s (2)" src="https://cdn.thestage.ai/production/cms_file_upload/1770235593-c873f699-07af-497b-ac77-2f5b08e3f767/NVIDIA, H100 (2).png" /> <!-- <img width="1547" height="877" alt="nvidia l40s (2)" src="https://github.com/user-attachments/assets/7c318bb6-cbd6-42ce-b42f-096cd7a1070c" /> -->

For comprehensive performance and quality benchmarks see benchmark/.


๐Ÿ“ฆ Quick start

Clone the repository

bash
git clone https://github.com/TheStageAI/TheWhisper.git cd TheWhisper

Install for Apple

bash
pip install .[apple]

Install for Nvidia

bash
pip install .[nvidia]

Install for Nvidia with TheStage AI optmized engines

bash
pip install 'thestage-elastic-models[nvidia]==0.1.7' --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple --extra-index-url https://pypi.nvidia.com --extra-index-url https://pypi.org/simple pip install .[nvidia] pip install thestage

Install for Jetson-Thor with TheStage AI optmized engines

Make sure you have tensorrt==10.13.3.9 installed on your jetson and run:

bash
pip install thestage-elastic-models[thor]==0.1.7 --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-jetson-thor/simple -i https://pypi.jetson-ai-lab.io/sbsa/cu130/+simple/ --extra-index-url https://pypi.org pip install . pip install thestage

Then generate access token on TheStage AI Platform in your profile and execute the following command:

bash
thestage config set -t <YOUR_API_TOKEN>

๐Ÿ—๏ธ Support Matrix and System Requirements

Featurewhisper-large-v3 (Nvidia)whisper-large-v3 (Apple)whisper-large-v3-turbo (Nvidia)whisper-large-v3-turbo (Apple)
Streamingโœ…โœ…โœ…โœ…
Acceleratedโœ…โœ…โœ…โœ…
Word Timestampsโœ…โœ…โœ…โœ…
Multilingualโœ…โœ…โœ…โœ…
10s Chunk Modeโœ…โœ…โœ…โœ…
15s Chunk Modeโœ…โœ…โœ…โœ…
20s Chunk Modeโœ…โœ…โœ…โœ…
30s Chunk Modeโœ…โœ…โœ…โœ…

Nvidia GPU Requirements

  • Supported GPUs: RTX 4090, RTX 5090, L40s, H100, A100, Jetson-Thor
  • Operating System: Ubuntu 20.04+
  • Minimum RAM: 2.5 GB (5 GB recommended for large-v3 model)
  • CUDA Version: 11.8 or higher
  • Driver Version: 520.0 or higher
  • Python version: 3.10-3.12

Apple Silicon Requirements

  • Supported Chipsets: M1, M1 Pro, M1 Max, M1 Ultra, M2, M2 Pro, M2 Max, M2 Ultra, M3, M3 Pro, M3 Max, M4, M4 Pro, M4 Max
  • Operating System: macOS 15.0 (Ventura) or later, iOS 18.0 or later
  • Minimum RAM: 2 GB (4 GB recommended for large-v3 model)
  • Python version: 3.10-3.12

โ–ถ๏ธ Usage and Deployment

Apple Usage

python
import torch from thestage_speechkit.apple import ASRPipeline model = ASRPipeline( model='TheStageAI/thewhisper-large-v3-turbo', # optimized model with ANNA model_size='S', chunk_length_s=10 ) # inference result = model( "path_to_your_audio.wav", return_timestamps="word" ) print(result["text"])

Apple Usage with Streaming

python
from thestage_speechkit.streaming import StreamingPipeline, MicStream, FileStream, StdoutStream streaming_pipe = StreamingPipeline( model='TheStageAI/thewhisper-large-v3-turbo', # Optimized model by ANNA model_size='S', # Window length chunk_length_s=10, platform='apple', language='en' ) # set stride in miliseconds mic_stream = MicStream(step_size_s=0.5) output_stream = StdoutStream() while True: chunk = mic_stream.next_chunk() if chunk is not None: approved_text, assumption = streaming_pipe(chunk) output_stream.write(approved_text, assumption) else: break

Nvidia Usage (HuggingFace Transfomers)

python
import torch from thestage_speechkit.nvidia import ASRPipeline model = ASRPipeline( model='TheStageAI/thewhisper-large-v3-turbo', # allowed: 10s, 15s, 20s, 30s chunk_length_s=10, # optimized TheStage AI engines batch_size=32, device='cuda' ) # inference result = model( "path_to_your_audio.wav", chunk_length_s=10, generate_kwargs={'do_sample': False, 'use_cache': True} ) print(result["text"])

Nvidia Usage (TheStage AI engines)

python
import torch from thestage_speechkit.nvidia import ASRPipeline model = ASRPipeline( model='TheStageAI/thewhisper-large-v3-turbo', # allowed: 10s, 15s, 20s, 30s chunk_length_s=10, # optimized TheStage AI engines model_size='S', batch_size=32, device='cuda' ) # inference result = model( "path_to_your_audio.wav", chunk_length_s=10, generate_kwargs={'do_sample': False, 'use_cache': True} ) print(result["text"])

๐Ÿ’ป Build On-Device Desktop Application for Apple

You can build a macOS desktop app with real-time transcription. Find a simple ReactJS application here: Link to React Frontend
You can also download our app built using this backend here: TheNotes for macOS


๐Ÿ“Š Benchmarks

TheWhisper is a fine-tuned Whisper model that can process audio chunks of any size up to 30 seconds. Unlike the original Whisper models, it doesn't require padding audio with silence to reach 30 seconds. For quality benchmarks, we used the multilingual benchmarks Open ASR Leaderboard.

For comprehensive quality and performance benchmarks, including comparisons with other Whisper inference solutions, please refer to the benchmark/ directory.

<img width="1547" height="531" alt="vanilla whisper (1)" src="https://github.com/user-attachments/assets/f0c86e58-d834-4ac7-a06b-df3a7ae3e9e9" /> <img width="1547" height="458" alt="TheStage AI Whisper (1)" src="https://github.com/user-attachments/assets/17fb45a3-b33d-4c83-b843-69b0f0aa3f65" /> <img alt="Open ASR Leaderboard Benchmark" src="https://cdn.thestage.ai/production/cms_file_upload/1770139173-58663708-4644-44a7-8225-763c33a2c95b/SOTA on Multilingual Open ASR benchmark.png" /> <img alt="Multilingual Benchmark" src="https://cdn.thestage.ai/production/cms_file_upload/1770139254-33d3c626-158f-42ec-a7f2-3a4bbe44b382/Open ASR Leaderboard Benchmark.png" />

๐Ÿข Enterprise License Summary

To get commercial license for bigger number of GPUs to use TheStage AI optimized engines please contact us here: Service request

PlatformEngine TypeStatusLicense
NVIDIA GPUs (CUDA)Pytorch HF Transformersโœ… StableFree
macOS / Apple SiliconCoreML Engine + MLXโœ… StableFree
NVIDIA GPUs (CUDA)TheStage AI (Optimized)โœ… StableFree โ‰ค 4 GPUs/year for small orgs

๐Ÿงญ Development Status

โœ… OpenASR WER benchmark for multiple chunk sizes

โœ… Performance benchmark for NVIDIA

โœ… Support for L40S, H100, RTX 4090, RTX 5090

โœ… Time-stamp support on Nvidia

โœ… Nvidia Jetson support

โ˜ Streaming containers for Nvidia

โ˜ Ready-to-go containers for inference on Nvidia GPUs with OpenAI compatible API

โ˜ Speaker diarization and speaker identification


๐Ÿ™Œ Acknowledgements

  • Silero VAD: Used for voice activity detection in thestage_speechkit/vad.py. See @snakers4.
  • OpenAI Whisper: Original Whisper model and pretrained checkpoints. See @openai.
  • Hugging Face Transformers: Model, tokenizer, and inference utilities. See @transformers.
  • MLX community: MLX Whisper implementation for Apple Silicon. See @mlx-explore.

Contributors

Showing top 5 contributors by commit count.

View all contributors on GitHub โ†’

This article is auto-generated from TheStageAI/TheWhisper via the GitHub API.Last fetched: 6/20/2026