Sova dataset

From sovaai·Updated May 25, 2026·View on GitHub·

Key facts: - Russian, English and Chinese languages - ~ 32 328 hours - ~ 3,21 TB in `.wav` format The project is distributed under the Other license, first published in 2019. Key topics include: audio, audio-data, audio-dataset, audio-datasets, chinese-dataset.

Latest release: v0.4.0— Release v0.4.0

November 8, 2022View Changelog →

SOVA Dataset

SOVA Dataset is free public STT/ASR dataset.

Key facts:

Russian, English and Chinese languages
~ 32 328 hours
~ 3,21 TB in .wav format

Dataset composition

Name		Lang	Hours	Size	Source	Equipment	Annotation	Speech type	Augmentation	Quality
EngAudiobooksOriginal	Download	EN	7 130	743 Gb	audiobook	professional	forced alignment	reading	none	95%
EngAudiobooksNoisy	Download	EN	3 873	310 Gb	audiobook	professional	forced alignment	reading	phone calls	95%
RuAudiobooksDevices	Download	RU	298	30,24 Gb	audiobook	unprofessional	manual	reading	none	99%
RuDevices	Download	RU	101	10,42 Gb	audio records	unprofessional	manual	live speech	none	98%
RuYoutube	Download	RU	17 451	1 873 Gb	audio records	unprofessional	asr	live speech	none	95%
ZhYoutube	Download	CN	3 475,1	321 Gb	audio records	unprofessional	asr	live speech	none	97.83%
TOTAL	-	-	32 328,1	3 287,66 Gb<br>(3,21 TB)	-	-	-	-	-	-

Audio characteristics

Bit rate mode: constant
Bit rate: 256 kbps
Channel(s): 1 channel
Sample rate: 16.0 kHz
Bit depth: 16 bit

Updates

08/11/2022: Release v0.4.0
10/12/2021: Release v0.3.0
22/12/2020: Release v0.2.0
24/12/2019: Published dataset with 116 hours.

Contacts

For all questions please feel free to contact us <a href="mailto:support@sova.ai?subject=SOVA Dataset">support@sova.ai</a>

License

SOVA Dataset is licensed under Creative Commons BY 4.0 license by Virtual Assistant, LLC.

Contributors

Showing top 2 contributors by commit count.

ZubarevEgor

ZubarevEgor

27 commits

Psyrakt

Psyrakt

1 commits

View all contributors on GitHub →

This article is auto-generated from sovaai/sova-dataset via the GitHub API.Last fetched: 6/26/2026