FunASR
Industrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API.
([简体中文](./README_zh.md)|English|[日本語](./README_ja.md)|[한국어](./README_ko.md)) The project is written primarily in Python, distributed under the MIT License license, first published in 2022. It has gained significant community traction with 16,628 stars and 1,715 forks on GitHub. Key topics include: asr, audio, chinese, emotion-recognition, mcp-server.
Quick Start
No local setup? Open the Colab quickstart to transcribe a public sample or upload your own audio in a browser.
bashpip install torch torchaudio pip install funasr
pythonfrom funasr import AutoModel model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad", spk_model="cam++", device="cuda") result = model.generate(input="meeting.wav")
Output — structured text with speaker labels, timestamps, and punctuation:
[00:00.4 → 00:03.8] Speaker 0: Let's discuss the Q3 plan.
[00:04.2 → 00:07.1] Speaker 1: Sounds good. I have three points.
[00:07.5 → 00:12.3] Speaker 0: Go ahead. We have 30 minutes.
That's it. One model, one call — VAD segmentation, speech recognition, punctuation, speaker diarization all happen automatically.
LLM-powered ASR: Fun-ASR-Nano
For highest accuracy across 31 languages (including Chinese dialects), use Fun-ASR-Nano — an LLM-based ASR combining SenseVoice encoder with Qwen3-0.6B decoder:
pythonfrom funasr import AutoModel model = AutoModel(model="FunAudioLLM/Fun-ASR-Nano-2512", vad_model="fsmn-vad", device="cuda") result = model.generate(input="meeting.wav")
With vLLM acceleration (16x faster, batch processing):
pythonfrom funasr.auto.auto_model_vllm import AutoModelVLLM model = AutoModelVLLM(model="FunAudioLLM/Fun-ASR-Nano-2512", tensor_parallel_size=1) results = model.generate(["audio1.wav", "audio2.wav"], language="auto")
Deploy as API server:
funasr-server --device cuda→ OpenAI-compatible endpoint at localhost:8000Use with AI agents: MCP Server for Claude/Cursor · OpenAI API for LangChain/Dify/AutoGen
Why FunASR?
| FunASR | Whisper | Cloud APIs | |
|---|---|---|---|
| Speed | 170x realtime | 13x realtime | ~1x realtime |
| Speaker ID | ✅ Built-in | ❌ Needs pyannote | ✅ Extra cost |
| Emotion | ✅ Happy/Sad/Angry | ❌ | ❌ |
| Languages | 50+ | 57 | Varies |
| Streaming | ✅ WebSocket | ❌ | ✅ |
| vLLM Acceleration | ✅ 2-3x faster | ❌ | N/A |
| Self-hosted | ✅ MIT license | ✅ MIT license | ❌ Cloud only |
| Cost | Free | Free | $0.006/min+ |
| CPU viable | ✅ 17x realtime | ❌ Too slow | N/A |
Trying FunASR for the first time? Use the Colab quickstart before setting up a local environment. Choosing a first model? Start with the model selection guide. Planning a switch from Whisper or a cloud ASR provider? Use the migration guide and benchmark example to test representative audio, map features, and roll out safely.
<a name="benchmark"></a>
Benchmark
184 long-form audio files (192 min). Full report →
| Model | GPU Speed | CPU Speed | vs Whisper-large-v3 |
|---|---|---|---|
| SenseVoice-Small | 170x realtime | 17x realtime | 🚀 13x faster |
| Paraformer-Large | 120x realtime | 15x realtime | 🚀 9x faster |
| Whisper-large-v3-turbo | 46x realtime | ❌ | 3.4x faster |
| Fun-ASR-Nano | 17x realtime | 3.6x realtime | 1.3x faster |
| Whisper-large-v3 | 13x realtime | ❌ | baseline |
Key takeaway: FunASR models run on CPU faster than Whisper runs on GPU.
What's new
- 2026/05/24: vLLM Inference Engine — 2-3x faster LLM decoding for Fun-ASR-Nano. Streaming WebSocket service with VAD + Speaker Diarization. Guide →
- 2026/05/24: Dynamic VAD — adaptive silence threshold (default on). Short sentences stay intact, long segments get auto-split. Details →
- 2026/05/24: v1.3.3 —
funasr-serverCLI, OpenAI-compatible API, MCP Server for AI agents.pip install --upgrade funasr - 2026/05/20: Added Qwen3-ASR (0.6B/1.7B) — 52 languages, auto detection. usage
- 2026/05/20: Added GLM-ASR-Nano (1.5B) — 17 languages, dialect support. usage
- 2026/05/19: Fun-ASR-Nano and SenseVoice now support speaker diarization.
- 2025/12/15: Fun-ASR-Nano-2512 — 31 languages, tens of millions of hours training.
- 2024/10/10: Whisper-large-v3-turbo support added.
- 2024/07/04: SenseVoice — ASR + emotion + audio events.
- 2024/01/30: FunASR 1.0 released.
Installation
<details><summary>From source / Requirements</summary>bashpip install funasr
bashgit clone https://github.com/modelscope/FunASR.git && cd FunASR pip install -e ./
Requirements: Python ≥ 3.8. Install PyTorch + torchaudio first (pytorch.org), then pip install funasr.
<a name="model-zoo"></a>
Model Zoo
| Model | Task | Languages | Params | Links |
|---|---|---|---|---|
| Fun-ASR-Nano | ASR + timestamps | 31 languages | 800M | ⭐ 🤗 |
| SenseVoiceSmall | ASR + emotion + events | zh/en/ja/ko/yue | 234M | ⭐ 🤗 |
| Paraformer-zh | ASR + timestamps | zh/en | 220M | ⭐ 🤗 |
| Paraformer-zh-streaming | Streaming ASR | zh/en | 220M | ⭐ 🤗 |
| Qwen3-ASR | ASR, 52 languages | multilingual | 1.7B | usage |
| GLM-ASR-Nano | ASR, 17 languages | multilingual | 1.5B | usage |
| Whisper-large-v3 | ASR + translation | multilingual | 1550M | usage |
| Whisper-large-v3-turbo | ASR + translation | multilingual | 809M | usage |
| ct-punc | Punctuation | zh/en | 290M | ⭐ 🤗 |
| fsmn-vad | VAD | zh/en | 0.4M | ⭐ 🤗 |
| cam++ | Speaker diarization | — | 7.2M | ⭐ 🤗 |
| emotion2vec+large | Emotion recognition | — | 300M | ⭐ 🤗 |
Usage
Full examples with parameter docs: Tutorial →
pythonfrom funasr import AutoModel # Chinese production (VAD + ASR + punctuation + speaker) model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc", spk_model="cam++", device="cuda") result = model.generate(input="meeting.wav", hotword="关键词 20") # 31 languages with timestamps model = AutoModel(model="FunAudioLLM/Fun-ASR-Nano-2512", hub="hf", trust_remote_code=True, vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000}, device="cuda") result = model.generate(input="audio.wav", batch_size=1) # Streaming real-time model = AutoModel(model="paraformer-zh-streaming", device="cuda") result = model.generate(input="chunk.wav", cache={}, chunk_size=[0, 10, 5]) # Emotion recognition model = AutoModel(model="emotion2vec_plus_large", device="cuda") result = model.generate(input="audio.wav", granularity="utterance")
Deploy
bash# OpenAI-compatible API (recommended) pip install torch torchaudio pip install funasr vllm fastapi uvicorn python-multipart funasr-server --device cuda # → POST /v1/audio/transcriptions at localhost:8000
Verify it with a public sample:
bashcurl -L https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav -o sample.wav curl http://localhost:8000/v1/audio/transcriptions \ -F file=@sample.wav \ -F model=sensevoice \ -F response_format=verbose_json
bash# Docker streaming service docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.12
OpenAI API example → · Gradio demo → · Client recipes → · JavaScript/TypeScript recipes → · Kubernetes template → · Workflow recipes → · Postman collection → · OpenAPI spec → · Security guide → · Deployment matrix → · Deployment docs → · Agent integration →
Community
| 📖 Documentation | 🐛 Issues |
| 💬 Discussions | 🤗 HuggingFace |
| 🤝 Contributing | 📈 20k growth plan |
Star History
<a href="https://star-history.com/#modelscope/FunASR&Date"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=modelscope/FunASR&type=Date&theme=dark" /> <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=modelscope/FunASR&type=Date" /> <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=modelscope/FunASR&type=Date" width="600" /> </picture> </a>License
Citations
bibtex@inproceedings{gao2023funasr, author={Zhifu Gao and others}, title={FunASR: A Fundamental End-to-End Speech Recognition Toolkit}, booktitle={INTERSPEECH}, year={2023} }
Contributors
Showing top 12 contributors by commit count.