Argmax oss swift
On-device Speech AI for Apple Silicon
[Argmax](https://argmaxinc.com/blog) Open-Source SDK Swift is a collection of turn-key on-device inference frameworks: - **WhisperKit** for speech-to-text with OpenAI Whisper - **SpeakerKit** for speaker diarization with Pyannote - **TTSKit** for text-to-speech with Qwen-TTS The project is written primarily in Swift, distributed under the MIT License license, first published in 2024. It has gained significant community traction with 6,223 stars and 573 forks on GitHub. Key topics include: inference, ios, macos, pyannote, qwen3-tts.
Argmax Open-Source SDK
</div>Argmax Open-Source SDK Swift is a collection of turn-key on-device inference frameworks:
- WhisperKit for speech-to-text with OpenAI Whisper
- SpeakerKit for speaker diarization with Pyannote
- TTSKit for text-to-speech with Qwen-TTS
[!IMPORTANT]
Argmax Pro SDK supports additional models and advanced features such as:
- Real-time transcription with speakers
- Frontier accuracy for your use case with custom vocabulary
- Argmax Local Server for non-native apps
- Android support with Argmax Pro SDK Kotlin
Further resources:
Table of Contents
Installation
Swift Package Manager
WhisperKit, TTSKit, and SpeakerKit are separate library products in the same Swift package. Add the package once and pick the products you need. You can also use the ArgmaxOSS umbrella product to import everything at once.
Prerequisites
- macOS 14.0 or later.
- Xcode 16.0 or later.
Xcode Steps
- Open your Swift project in Xcode.
- Navigate to
File>Add Package Dependencies.... - Enter the package repository URL:
https://github.com/argmaxinc/argmax-oss-swift. - Choose the version range or specific version.
- When prompted to choose library products, select ArgmaxOSS (all kits), or individual kits: WhisperKit, TTSKit, SpeakerKit.
Package.swift
Add the package dependency:
swiftdependencies: [ .package(url: "https://github.com/argmaxinc/argmax-oss-swift.git", from: "0.9.0"), ],
Then add the products you need as target dependencies:
swift.target( name: "YourApp", dependencies: [ // Import everything at once: .product(name: "ArgmaxOSS", package: "argmax-oss-swift"), // Or pick individual kits: // .product(name: "WhisperKit", package: "argmax-oss-swift"), // speech-to-text // .product(name: "TTSKit", package: "argmax-oss-swift"), // text-to-speech // .product(name: "SpeakerKit", package: "argmax-oss-swift"), // speaker diarization ] ),
Homebrew
You can install the command line app using Homebrew by running the following command:
bashbrew install whisperkit-cli
WhisperKit
To get started with WhisperKit, you need to initialize it in your project.
Quick Example
This example demonstrates how to transcribe a local audio file:
swiftimport WhisperKit // Initialize WhisperKit with default settings Task { let pipe = try? await WhisperKit() let transcription = try? await pipe!.transcribe(audioPath: "path/to/your/audio.{wav,mp3,m4a,flac}")?.text print(transcription) }
Model Selection
[!NOTE]
Argmax recommendslarge-v3-v20240930_626MBfor maximum multilingual accuracy andtinyfor the fastest debugging workflow.
| Whisper Version | WhisperKit Variant | Description |
|---|---|---|
| Large v3 Turbo (compressed) | large-v3-v20240930_626MB | Recommended across iOS and macOS for maximum accuracy |
| Large v3 Turbo | large-v3-v20240930_turbo | Recommended on macOS for maximum speed and accuracy |
| Base (multilingual) | base | |
| Base (English-only) | base.en | |
| Small (Multilingual) | small | |
| Small (English-only) | small.en | |
| Tiny (Multilingual) | tiny | |
| Tiny (English-only) | tiny.en | Smallest size, lowest accuracy. Only recommended for development & debugging. |
WhisperKit automatically downloads the recommended model for the device if not specified. You can also select a specific model by passing in the model name:
swiftlet pipe = try? await WhisperKit(WhisperKitConfig(model: "large-v3-v20240930_626MB"))
This method also supports glob search, so you can use wildcards to select a model:
swiftlet pipe = try? await WhisperKit(WhisperKitConfig(model: "large-v3-v20240930_626MB"))
Note that the model search must return a single model from the source repo, otherwise an error will be thrown.
For a list of available models, see our HuggingFace repo.
Generating Models
WhisperKit also comes with the supporting repo whisperkittools which lets you create and deploy your own fine tuned versions of Whisper in CoreML format to HuggingFace. Once generated, they can be loaded by simply changing the repo name to the one used to upload the model:
swiftlet config = WhisperKitConfig(model: "large-v3-v20240930_626MB", modelRepo: "username/your-model-repo") let pipe = try? await WhisperKit(config)
Swift CLI
The Swift CLI allows for quick testing and debugging outside of an Xcode project. To install it, run the following:
bashgit clone https://github.com/argmaxinc/argmax-oss-swift.git cd argmax-oss-swift
Then, setup the environment and download your desired model.
bashmake setup make download-model MODEL=large-v3-v20240930_626MB
Note:
- This will download only the model specified by
MODEL(see what's available in our HuggingFace repo, where we use the prefixopenai_whisper-{MODEL}) - Before running
download-model, make sure git-lfs is installed
If you would like download all available models to your local folder, use this command instead:
bashmake download-models
You can then run them via the CLI with:
bashswift run argmax-cli transcribe --model-path "Models/whisperkit-coreml/openai_whisper-large-v3-v20240930_626MB" --audio-path "path/to/your/audio.{wav,mp3,m4a,flac}"
Which should print a transcription of the audio file. If you would like to stream the audio directly from a microphone, use:
bashswift run argmax-cli transcribe --model-path "Models/whisperkit-coreml/openai_whisper-large-v3-v20240930_626MB" --stream
Local Server
The Argmax CLI includes a local server that implements the OpenAI Audio API, allowing you to use existing OpenAI SDK clients or generate new ones. The server supports transcription and translation with output streaming capabilities (real-time transcription results as they're generated).
[!NOTE]
Argmax Pro Local Server provides a real-time streaming transcription with a WebSocket local server that is API-compatible with cloud-based providers such as Deepgram.
Building the Server
bash# Build with server support make build-local-server # Or manually with the build flag BUILD_ALL=1 swift build --product argmax-cli
Starting the Server
bash# Start server with default settings BUILD_ALL=1 swift run argmax-cli serve # Custom host and port BUILD_ALL=1 swift run argmax-cli serve --host 0.0.0.0 --port 8080 # With specific model and verbose logging BUILD_ALL=1 swift run argmax-cli serve --model tiny --verbose # See all configurable parameters BUILD_ALL=1 swift run argmax-cli serve --help
API Endpoints
- POST
/v1/audio/transcriptions- Transcribe audio to text - POST
/v1/audio/translations- Translate audio to English
Supported Parameters
| Parameter | Description | Default |
|---|---|---|
file | Audio file (wav, mp3, m4a, flac) | Required |
model | Model identifier | Server default |
language | Source language code | Auto-detect |
prompt | Text to guide transcription | None |
response_format | Output format (json, verbose_json) | verbose_json |
temperature | Sampling temperature (0.0-1.0) | 0.0 |
timestamp_granularities[] | Timing detail (word, segment) | segment |
stream | Enable streaming | false |
Client Examples
Python Client (OpenAI SDK)
bashcd Examples/ServeCLIClient/Python uv sync python whisperkit_client.py transcribe --file audio.wav --language en python whisperkit_client.py translate --file audio.wav
Quick Python example:
pythonfrom openai import OpenAI client = OpenAI(base_url="http://localhost:50060/v1") result = client.audio.transcriptions.create( file=open("audio.wav", "rb"), model="tiny" # Model parameter is required ) print(result.text)
Swift Client (Generated from OpenAPI Spec, see ServeCLIClient/Swift/updateClient.sh)
bashcd Examples/ServeCLIClient/Swift swift run whisperkit-client transcribe audio.wav --language en swift run whisperkit-client translate audio.wav
CurlClient (Shell Scripts)
bashcd Examples/ServeCLIClient/Curl chmod +x *.sh ./transcribe.sh audio.wav --language en ./translate.sh audio.wav --language es ./test.sh # Run comprehensive test suite
Generating the API Specification
The server's OpenAPI specification and code are generated from the official OpenAI API:
bash# Generate latest spec and server code make generate-server
Client Generation
You can generate clients for any language using the OpenAPI specification, for example:
bash# Generate Python client swift run swift-openapi-generator generate scripts/specs/localserver_openapi.yaml \ --output-directory python-client \ --mode client \ --mode types # Generate TypeScript client npx @openapitools/openapi-generator-cli generate \ -i scripts/specs/localserver_openapi.yaml \ -g typescript-fetch \ -o typescript-client
API Limitations
Compared to the official OpenAI API, the local server has these limitations:
- Response formats: Only
jsonandverbose_jsonsupported (no plain text, SRT, VTT formats) - Model selection: Client must launch server with desired model via
--modelflag
Fully Supported Features
The local server fully supports these OpenAI API features:
- Include parameters:
logprobsparameter for detailed token-level log probabilities - Streaming responses: Server-Sent Events (SSE) for real-time transcription
- Timestamp granularities: Both
wordandsegmentlevel timing - Language detection: Automatic language detection or manual specification
- Temperature control: Sampling temperature for transcription randomness
- Prompt text: Text guidance for transcription style and context
TTSKit
TTSKit is an on-device text-to-speech framework built on Core ML. It runs Qwen3-TTS models entirely on Apple silicon with real-time streaming playback, no server required.
- macOS 15.0 or later.
- iOS 18.0 or later.
Quick Example
This example demonstrates how to generate speech from text:
swiftimport TTSKit Task { let tts = try await TTSKit() let result = try await tts.generate(text: "Hello from TTSKit!") print("Generated \(result.audioDuration)s of audio at \(result.sampleRate)Hz") }
TTSKit() automatically downloads the default 0.6B model on first run. The tokenizer and CoreML models are loaded lazily on the first generate() call.
Model Selection
TTSKit ships two model sizes. You can select the model by passing a variant to TTSKitConfig:
swift// Fast, runs on all platforms (~1 GB download) let tts = try await TTSKit(TTSKitConfig(model: .qwen3TTS_0_6b)) // Higher quality, macOS only (~2.2 GB download, supports style instructions) let tts = try await TTSKit(TTSKitConfig(model: .qwen3TTS_1_7b))
Models are hosted on HuggingFace and cached locally after the first download.
Custom Voices
You can choose from 9 built-in voices and 10 languages:
swiftlet result = try await tts.generate( text: "こんにちは世界", speaker: .onoAnna, language: .japanese )
Voices: .ryan, .aiden, .onoAnna, .sohee, .eric, .dylan, .serena, .vivian, .uncleFu
Languages: .english, .chinese, .japanese, .korean, .german, .french, .russian, .portuguese, .spanish, .italian
Real-Time Streaming Playback
play streams audio to the device speakers frame-by-frame as it is generated:
swifttry await tts.play(text: "This starts playing before generation finishes.")
You can control how much audio is buffered before playback begins. The default .auto strategy measures the first generation step and pre-buffers just enough to avoid underruns:
swifttry await tts.play( text: "Long passage...", playbackStrategy: .auto )
Other strategies include .stream (immediate, no buffer), .buffered(seconds:) (fixed pre-buffer), and .generateFirst (generate all audio first, then play).
Generation Options
You can customize sampling, chunking, and concurrency via GenerationOptions:
swift// Defaults recommended by Qwen var options = GenerationOptions() options.temperature = 0.9 options.topK = 50 options.repetitionPenalty = 1.05 options.maxNewTokens = 245 // Long text is automatically split at sentence boundaries options.chunkingStrategy = .sentence options.concurrentWorkerCount = nil // nil = all chunks run concurrently with a good default for the device let result = try await tts.generate(text: longArticle, options: options)
Style Instructions (1.7B only)
The 1.7B model accepts a natural-language style instruction that controls prosody:
swiftvar options = GenerationOptions() options.instruction = "Speak slowly and warmly, like a storyteller." let result = try await tts.generate( text: "Once upon a time...", speaker: .ryan, options: options )
Saving Audio
Generated audio can be saved to WAV or M4A:
swiftlet result = try await tts.generate(text: "Save me!") let outputDir = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0] // Save as .wav or .m4a (AAC) try await AudioOutput.saveAudio(result.audio, toFolder: outputDir, filename: "output", format: .m4a)
Progress Callbacks
You can receive per-step audio during generation. Return false from the callback to cancel early:
swiftlet result = try await tts.generate(text: "Hello!") { progress in print("Audio chunk: \(progress.audio.count) samples") if let stepTime = progress.stepTime { print("First step took \(stepTime)s") } return true // return false to cancel }
Swift CLI
The TTS command is available through the argmax-cli tool. You can generate speech and optionally play it back in real time:
bashswift run argmax-cli tts --text "Hello from the command line" --play swift run argmax-cli tts --text "Save to file" --output-path output.wav swift run argmax-cli tts --text "日本語テスト" --speaker ono-anna --language japanese swift run argmax-cli tts --text-file article.txt --model 1.7b --instruction "Read cheerfully" swift run argmax-cli tts --help
Demo App
The TTSKitExample example app showcases real-time streaming, model management, waveform visualization, and generation history on macOS and iOS. See the TTSKitExample README for build instructions.
SpeakerKit
SpeakerKit is an on-device speaker diarization framework built on Core ML. It runs Pyannote v4 (community-1) on Apple silicon to label speakers in audio. Read the blog post for architecture details and benchmarks.
- macOS 13.0 or later.
- iOS 16.0 or later.
Quick Example
This example demonstrates how to diarize an audio file:
swiftimport SpeakerKit Task { let speakerKit = try await SpeakerKit() let audioArray = try AudioProcessor.loadAudioAsFloatArray(fromPath: "audio.wav") let result = try await speakerKit.diarize(audioArray: audioArray) print("Detected \(result.speakerCount) speakers") for segment in result.segments { print(segment) } }
SpeakerKit() uses PyannoteConfig() defaults, automatically downloading models from HuggingFace on first run. The segmenter and embedder CoreML models are loaded lazily (unless load is set on config) on the first diarize() call.
Diarization Options
You can control speaker detection via PyannoteDiarizationOptions:
swiftlet audioArray = try AudioProcessor.loadAudioAsFloatArray(fromPath: "audio.wav") let options = PyannoteDiarizationOptions( numberOfSpeakers: 2, // nil = automatic detection clusterDistanceThreshold: 0.6, // clustering threshold useExclusiveReconciliation: false // exclusive speaker assignment per frame ) let result = try await speakerKit.diarize(audioArray: audioArray, options: options)
For local models, skip the download step:
swiftlet config = PyannoteConfig(modelFolder: "/path/to/models") let speakerKit = try await SpeakerKit(config)
Combining with Transcription
SpeakerKit can merge diarization results with WhisperKit transcriptions to produce speaker-attributed segments:
swiftimport WhisperKit import SpeakerKit let whisperKit = try await WhisperKit() let speakerKit = try await SpeakerKit() let audioArray = try AudioProcessor.loadAudioAsFloatArray(fromPath: "audio.wav") let transcription = try await whisperKit.transcribe(audioArray: audioArray) let diarization = try await speakerKit.diarize(audioArray: audioArray) let speakerSegments = diarization.addSpeakerInfo(to: transcription) for group in speakerSegments { for segment in group { print("\(segment.speaker): \(segment.text)") } }
Two strategies are available for matching speakers to transcription:
.subsegment(default) -- splits segments at word gaps, then assigns speakers.segment-- assigns a speaker to each transcription segment as a whole
RTTM Output
Generate RTTM output:
swiftlet speakerKit = try await SpeakerKit() let audioArray = try AudioProcessor.loadAudioAsFloatArray(fromPath: "meeting.wav") let diarization = try await speakerKit.diarize(audioArray: audioArray) let rttmLines = SpeakerKit.generateRTTM(from: diarization, fileName: "meeting") for line in rttmLines { print(line) }
Swift CLI
The diarization commands are available through the argmax-cli tool:
bash# Standalone diarization swift run argmax-cli diarize --audio-path audio.wav --verbose # Save RTTM output swift run argmax-cli diarize --audio-path audio.wav --rttm-path output.rttm # Specify number of speakers swift run argmax-cli diarize --audio-path audio.wav --num-speakers 3 # Transcription with diarization swift run argmax-cli transcribe --audio-path audio.wav --diarization # See all options swift run argmax-cli diarize --help
Contributing & Roadmap
Our goal is to make this SDK better and better over time and we'd love your help! Just search the code for "TODO" for a variety of features that are yet to be built. Please refer to our contribution guidelines for submitting issues, pull requests, and coding standards, where we also have a public roadmap of features we are looking forward to building in the future.
External dependencies: Sources/ArgmaxCore/External/ contains a copy of swift-transformers (Hub and Tokenizers modules, v1.1.6) with Jinja-dependent code removed. When updating to a newer version, copy the fresh sources over that directory and re-apply the patches marked with // Argmax-modification: (grep -r "Argmax-modification:" Sources/ArgmaxCore/External/).
License
Argmax OSS is released under the MIT License. See LICENSE for more details.
This project incorporates third-party software under their own license terms. See NOTICES for attributions.
Citation
If you use this SDK for something cool or just find it useful, please drop us a note at info@argmaxinc.com!
If you use WhisperKit, SpeakerKit or TTSKit for academic work, please cite the project using the following BibTeX:
bibtex@misc{whisperkit-argmax, title = {Argmax OSS: On-device Speech AI with WhisperKit, SpeakerKit and TTSKit}, author = {Argmax, Inc.}, year = {2024}, URL = {https://github.com/argmaxinc/argmax-oss-swift} }
Contributors
Showing top 12 contributors by commit count.
