GitPedia

Free svc

[ICASSP 2025] FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion

From freds0·Updated June 26, 2026·View on GitHub·

> **This is the official code implementation of FreeSVC [ICASSP 2025]** The project is written primarily in Python, distributed under the MIT License license, first published in 2024. Key topics include: singing-voice-conversion, speech-processing, svc, voice-conversion.

<div align="center">

FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion

</div>

arXiv
HuggingFace badge

This is the official code implementation of FreeSVC [ICASSP 2025]

Introduction

FreeSVC is a multilingual singing voice conversion model that converts singing voices across different languages. It leverages an enhanced VITS model integrated with Speaker-invariant Clustering (SPIN) and the ECAPA2 speaker encoder to effectively separate speaker characteristics from linguistic content. Designed for zero-shot learning, FreeSVC supports cross-lingual voice conversion without the need for extensive language-specific training.

Key Features

  • Multilingual Support: Incorporates trainable language embeddings, enabling effective handling of multiple languages without extensive language-specific training.
  • Advanced Speaker Encoding: Utilizes the State-of-the-Art (SOTA) speaker encoder ECAPA2 to disentangle speaker characteristics from linguistic content, ensuring high-quality voice conversion.
  • Zero-Shot Learning: Allows cross-lingual singing voice conversion even with unseen speakers, enhancing versatility and applicability.
  • Enhanced VITS Model with SPIN: Improves content representation for more accurate and natural voice conversion.
  • Optimized Cross-Language Conversion: Demonstrates the importance of a multilingual content extractor for achieving optimal performance in cross-language voice conversion tasks.

Model Architecture

FreeSVC builds upon the VITS architecture, integrating several key components:

  1. Content Extractor: Utilizes SPIN, an enhanced version of ContentVec based on HuBERT, to extract linguistic content while separating speaker timbre.
  2. Speaker Encoder: Employs ECAPA2 to capture unique speaker characteristics, ensuring accurate disentanglement from linguistic content.
  3. Pitch Extractor: Uses RMVPE to robustly extract vocal pitches from polyphonic music, preserving the original melody.
  4. Language Embeddings: Incorporates trainable language embeddings to condition the model for multilingual training and conversion.
<div align="center"> <img src="resources/freesvc.png" alt="FreeSVC Architecture" height="400"> <p><em>Figure 1: Comprehensive diagram of the FreeSVC model illustrating the training and inference procedures.</em></p> </div>

Dataset

FreeSVC is trained on a diverse set of speech and singing datasets covering multiple languages:

DatasetHoursSpeakersLanguageType
AISHELL-1170h214 F, 186 MChineseSpeech
AISHELL-385h176 F, 42 MChineseSpeech
CML-TTS3.1k231 F, 194 M7 LanguagesSpeech
HiFiTTS292h6 F, 4 MEnglishSpeech
JVS30h51 F, 49 MJapaneseSpeech
LibriTTS-R585h2,456EnglishSpeech
NUS (NHSS)7h5 F, 5 MEnglishBoth
OpenSinger50h41 F, 25 MChineseSinging
Opencpop5h1 FChineseSinging
PopBuTFy10h, 40h12, 22Chinese, EnglishSinging
POPCS5h1 FChineseSinging
VCTK44h109EnglishSpeech
VocalSet10h11 F, 9 MVariousSinging

Getting Started

  1. Clone the Repository:

    bash
    git clone https://github.com/freds0/free-svc.git cd free-svc
  2. Create a Docker Image:

    • Build the Docker image using the provided Dockerfile:
      bash
      docker build -t freesvc .
  3. Run the Docker Container:

    • Start the Docker container and mount the current directory:
      bash
      docker run -it --rm -v "$(pwd)":/workspace freesvc
  4. Prepare the Dataset:

    • Execute the dataset preparation script:
      bash
      bash prepare_{name}_dataset.sh
      Replace {name} with the appropriate dataset identifier.
  5. Download Required Models:

    • WavLM Large Model:

    • HifiGAN Model:

    • RMVPE Model:

      • Download from HuggingFace
      • Place the downloaded model in models/f0_predictor/ckpt
    • Discriminator:

      • Download the discriminator from FreeVC named D-freevc-24.pth inside the 24kHz folder, here.
  6. Install Dependencies:

    bash
    pip install -r requirements.txt
  7. Train the Model:

    • Run the training script with the appropriate configuration:
      bash
      python train.py --config-dir configs --config-name sovits-online_hubert data.dataset_dir={dataset_dir}
      Replace {dataset_dir} with the path to your dataset directory.

Audio Conversion

This section explains how to use the FreeSVC model for audio conversion.

bash
python scripts/inference.py --hpfile path/to/config.yaml \ --ptfile path/to/checkpoint.pth \ --input-base-dir path/to/input/directory \ --metadata-path path/to/metadata.csv \ --spk-emb-base-dir path/to/speaker/embeddings \ --out-dir path/to/output_directory \ [--use-vad] \ [--use-timestamp] \ [--concat-audio] \ [--pitch-factor PITCH_FACTOR]

Parameters:

  • --hpfile: Path to the configuration YAML file
  • --ptfile: Path to the model checkpoint file
  • --input-base-dir: Base directory containing source audio files
  • --metadata-path: Path to the CSV metadata file
  • --spk-emb-base-dir: Directory containing speaker embeddings
  • --out-dir: Output directory for converted audio (default: "gen-samples/")

Optional Parameters:

  • --pitch-predictor: Pitch predictor model type (default: "rmvpe")
  • --use-vad: Enable Voice Activity Detection for better segmentation
  • --use-timestamp: Add timestamps to output filenames
  • --concat-audio: Concatenate all converted segments into a single file
  • --pitch-factor: Adjust pitch modification factor (default: 0.9544)
  • --ignore-metadata-header: Skip first row of metadata CSV (default: True)

Metadata CSV Format:

source_path|source_lang|source_speaker|target_speaker|target_lang
./audio/source1.wav|en|speaker1|speaker2|ja
./audio/source2.wav|zh|speaker3|speaker4|en

Required columns:

  • source_path: Path to source audio file (relative to input-base-dir)
  • source_lang: Source language code (e.g., 'en', 'ja', 'zh')
  • source_speaker: Source speaker identifier
  • target_speaker: Target speaker identifier
  • target_lang: Target language code
  • transcript: (Optional) Text transcript of the audio

Output Directory Structure
The converted audio files will be organized in the following structure:

output_dir/
└── metadata_name/
    └── source_lang/
        └── target_lang/
            └── source_speaker/
                └── target_speaker/
                    └── converted_audio.wav

Additional Notes

  • Voice Activity Detection (VAD): When VAD is enabled using the --use-vad flag, the system performs intelligent speech segmentation on the input audio. It automatically detects and isolates speech segments for processing while maintaining non-speech portions of the audio. Each detected speech segment is processed independently, and the system then reconstructs the full audio by concatenating all segments in their original order. This approach ensures high-quality conversion while preserving the natural rhythm and timing of the original audio.

  • Pitch Adjustment: The system offers precise pitch control through the --pitch-factor parameter. This factor acts as a multiplier for the output pitch. Users can fine-tune this parameter to achieve the desired pitch characteristics in the converted audio.

  • Audio Concatenation: The --concat-audio option provides a convenient way to combine multiple conversions into a single audio file. When enabled, the system will automatically merge all converted segments into one continuous audio file, saved as "all.wav" in the output directory. This feature is particularly useful when processing multiple short segments that belong together or when creating a compilation of converted audio.

Pretrained Models

License Notice:
The pretrained model weights are released under CC BY-NC-SA due to dataset licensing restrictions. This means they cannot be used for commercial purposes. However, the code itself remains under the MIT License.

Pretrained weights for FreeSVC are available on the Hugging Face Model Hub. We provide both the backbone model and our fine-tuned multilingual version of SPIN.

Available Files

To utilize the pretrained models, download the following files to your local machine:

Quality Notice

Please be aware that this model is a preliminary release. It has been primarily fine-tuned on speech datasets, which may result in quality limitations. We are actively working on higher-quality models that will address these issues and deliver enhanced performance for both singing and speech conversion in the near future.

TODO

  • Train a high-quality version exclusively for singing.
  • Train a high-quality version exclusively for speech.

License

The code in this repository is licensed under the MIT License, allowing for commercial and unrestricted use.

However, the pre-trained model weights are subject to dataset-specific restrictions and are licensed under CC BY-NC-SA (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License). This means that the weights cannot be used for commercial purposes.

If you plan to use the pretrained models, please ensure that your use case complies with the dataset restrictions.

Citation

@INPROCEEDINGS{10890068,
  author={Ferreira, Alef Iury and Gris, Lucas Rafael and Da Rosa, Augusto and Oliveira, Frederico and Casanova, Edresson and Sousa, Rafael and Junior, Arnaldo and Soares, Anderson and Filho, Arlindo Galvão},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Training;Source coding;Zero shot learning;Refining;Signal processing;Data models;Acoustics;Multilingual;Data mining;Speech synthesis;Singing Voice Conversion;Synthesis of Singing Voices;Cross-lingual and multilingual aspects in speech synthesis},
  doi={10.1109/ICASSP49660.2025.10890068}}

Acknowledgements

Contributors

Showing top 5 contributors by commit count.

View all contributors on GitHub →

This article is auto-generated from freds0/free-svc via the GitHub API.Last fetched: 6/26/2026