GitPedia

Sova dataset

From sovaai·Updated May 25, 2026·View on GitHub·

Key facts: - Russian, English and Chinese languages - ~ 32 328 hours - ~ 3,21 TB in `.wav` format The project is distributed under the Other license, first published in 2019. Key topics include: audio, audio-data, audio-dataset, audio-datasets, chinese-dataset.

Latest release: v0.4.0Release v0.4.0
November 8, 2022View Changelog →

SOVA Dataset

SOVA Dataset is free public STT/ASR dataset.

Key facts:

  • Russian, English and Chinese languages
  • ~ 32 328 hours
  • ~ 3,21 TB in .wav format

Dataset composition

NameLangHoursSizeSourceEquipmentAnnotationSpeech typeAugmentationQuality
EngAudiobooksOriginalDownloadEN7 130743 Gbaudiobookprofessionalforced alignmentreadingnone95%
EngAudiobooksNoisyDownloadEN3 873310 Gbaudiobookprofessionalforced alignmentreadingphone calls95%
RuAudiobooksDevicesDownloadRU29830,24 Gbaudiobookunprofessionalmanualreadingnone99%
RuDevicesDownloadRU10110,42 Gbaudio recordsunprofessionalmanuallive speechnone98%
RuYoutubeDownloadRU17 4511 873 Gbaudio recordsunprofessionalasrlive speechnone95%
ZhYoutubeDownloadCN3 475,1321 Gbaudio recordsunprofessionalasrlive speechnone97.83%
TOTAL--32 328,13 287,66 Gb<br>(3,21 TB)------

Audio characteristics

  • Bit rate mode: constant
  • Bit rate: 256 kbps
  • Channel(s): 1 channel
  • Sample rate: 16.0 kHz
  • Bit depth: 16 bit

Updates

Contacts

For all questions please feel free to contact us <a href="mailto:support@sova.ai?subject=SOVA Dataset">support@sova.ai</a>

License

SOVA Dataset is licensed under Creative Commons BY 4.0 license by Virtual Assistant, LLC.

Contributors

Showing top 2 contributors by commit count.

View all contributors on GitHub →

This article is auto-generated from sovaai/sova-dataset via the GitHub API.Last fetched: 6/26/2026