Klaam
Arabic speech recognition, classification and text-to-speech.
Arabic speech recognition, classification and text-to-speech using many advanced models like wave2vec and fastspeech2. This repository allows training and prediction using pretrained models. The project is written primarily in Jupyter Notebook, distributed under the MIT License license, first published in 2021. Key topics include: arabic, arabic-speech-recognition, asr, tts.
klaam
Arabic speech recognition, classification and text-to-speech using many advanced models like wave2vec and fastspeech2. This repository allows training and prediction using pretrained models.
<p align="center"> <img src="https://raw.githubusercontent.com/ARBML/klaam/main/misc/klaam_logo.png" width="250px"/> </p>1. Usage
1.1 Speech Classification
pythonfrom klaam import SpeechClassification model = SpeechClassification() model.classify(wav_file)
1.2 Speech Recongnition
pythonfrom klaam import SpeechRecognition model = SpeechRecognition() model.transcribe(wav_file)
1.3 Text To Speech
pythonfrom klaam import TextToSpeech prepare_tts_model_path = "../cfgs/FastSpeech2/config/Arabic/preprocess.yaml" model_config_path = "../cfgs/FastSpeech2/config/Arabic/model.yaml" train_config_path = "../cfgs/FastSpeech2/config/Arabic/train.yaml" vocoder_config_path = "../cfgs/FastSpeech2/model_config/hifigan/config.json" speaker_pre_trained_path = "../data/model_weights/hifigan/generator_universal.pth.tar" model = TextToSpeech(prepare_tts_model_path, model_config_path, train_config_path, vocoder_config_path, speaker_pre_trained_path) model.synthesize(sample_text)
There are two avilable models for recognition trageting Modern Standard Arabic (MSA) and Egyptian dialect (EGY) . You can set any of them using the lang attribute.
pythonfrom klaam import SpeechRecognition model = SpeechRecognition(lang = 'msa') model.transcribe('file.wav')
2. Datasets
| Dataset | Description | Link |
|---|---|---|
| MGB-3 | Egyptian Arabic Speech recognition in the wild. Every sentence was annotated by four annotators. More than 15 hours have been collected from YouTube. | here [Registeration required] |
| ADI-5 | More than 50 hours collected from Aljazeera TV. 4 regional dialectal: Egyptian (EGY), Levantine (LAV), Gulf (GLF), North African (NOR), and Modern Standard Arabic (MSA). This dataset is a part of the MGB-3 challenge. | here [Registeration required] |
| Common voice | Multlilingual dataset avilable on huggingface | here. |
| Arabic Speech Corpus | Arabic dataset with alignment and transcriptions | here. |
3. Models
Our project currently supports four models, three of them are avilable on transformers.
| Language | Description | Source |
|---|---|---|
| Egyptian | Speech recognition | wav2vec2-large-xlsr-53-arabic-egyptian |
| Standard Arabic | Speech recognition | wav2vec2-large-xlsr-53-arabic |
| EGY, NOR, LAV, GLF, MSA | Speech classification | wav2vec2-large-xlsr-dialect-classification |
| Standard Arabic | Text-to-Speech | fastspeech2 |
4. Example Notebooks
<table> <tr> <th><b>Name</b></th> <th><b>Description</b></th> <th><b>Notebook</b></th> </tr> <tr> <td>Demo</td> <td>Classification, Recongition and Text-to-speech in a few lines of code.</td> <td><a href="https://colab.research.google.com/github/ARBML/klaam/blob/main/notebooks/demo.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg" > </a></td> </tr> <tr> <td>Demo with mic</td> <td>Audio Recongition and classification with recording.</td> <td><a href="https://colab.research.google.com/github/ARBML/klaam/blob/main/notebooks/demo_with_mic.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg"> </a></td> </tr> <table>5. Training
The scripts are a modification of jqueguiner/wav2vec2-sprint.
5.1. Classification
This script is used for the classification task on the 5 classes.
shpython run_classifier.py \ --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \ --output_dir=/path/to/output \ --cache_dir=/path/to/cache/ \ --freeze_feature_extractor \ --num_train_epochs="50" \ --per_device_train_batch_size="32" \ --preprocessing_num_workers="1" \ --learning_rate="3e-5" \ --warmup_steps="20" \ --evaluation_strategy="steps"\ --save_steps="100" \ --eval_steps="100" \ --save_total_limit="1" \ --logging_steps="100" \ --do_eval \ --do_train \
5.2. Recognition
This script is for training on the dataset for pretraining on the egyption dialects dataset.
shpython run_mgb3.py \ --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \ --output_dir=/path/to/output \ --cache_dir=/path/to/cache/ \ --freeze_feature_extractor \ --num_train_epochs="50" \ --per_device_train_batch_size="32" \ --preprocessing_num_workers="1" \ --learning_rate="3e-5" \ --warmup_steps="20" \ --evaluation_strategy="steps"\ --save_steps="100" \ --eval_steps="100" \ --save_total_limit="1" \ --logging_steps="100" \ --do_eval \ --do_train \
This script can be used for Arabic common voice training
shpython run_common_voice.py \ --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \ --dataset_config_name="ar" \ --output_dir=/path/to/output/ \ --cache_dir=/path/to/cache \ --overwrite_output_dir \ --num_train_epochs="1" \ --per_device_train_batch_size="32" \ --per_device_eval_batch_size="32" \ --evaluation_strategy="steps" \ --learning_rate="3e-4" \ --warmup_steps="500" \ --fp16 \ --freeze_feature_extractor \ --save_steps="10" \ --eval_steps="10" \ --save_total_limit="1" \ --logging_steps="10" \ --group_by_length \ --feat_proj_dropout="0.0" \ --layerdrop="0.1" \ --gradient_checkpointing \ --do_train --do_eval \ --max_train_samples 100 --max_val_samples 100
5.3. Text To Speech
We use the pytorch implementation of fastspeech2 by ming024.
The procedure is as the following:
- Download the dataset and unzip it.
wget http://en.arabicspeechcorpus.com/arabic-speech-corpus.zip
unzip arabic-speech-corpus.zip
- Create multiple directories for data
mkdir -p raw_data/Arabic/Arabic preprocessed_data/Arabic/TextGrid/Arabic
cp arabic-speech-corpus/textgrid/* preprocessed_data/Arabic/TextGrid/Arabic
- Prepare metadata
pythonimport os base_dir = '/content/arabic-speech-corpus' lines = [] for lab_file in os.listdir(f'{base_dir}/lab'): lines.append(lab_file[:-4]+'|'+open(f'{base_dir}/lab/{lab_file}', 'r').read()) open(f'{base_dir}/metadata.csv', 'w').write(('\n').join(lines))
- Clone my repository (FastSpeech2) and installl the dependencies required.
bashgit clone --depth 1 https://github.com/zaidalyafeai/FastSpeech2 cd FastSpeech2 pip install -r requirements.txt
- Prepare alignments and prepreocessed data.
python3 prepare_align.py config/Arabic/preprocess.yaml
python3 preprocess.py config/Arabic/preprocess.yaml
- Unzip vocoders.
unzip hifigan/generator_LJSpeech.pth.tar.zip -d hifigan
unzip hifigan/generator_universal.pth.tar.zip -d hifigan
- Start the training.
python3 train.py -p config/Arabic/preprocess.yaml -m config/Arabic/model.yaml -t config/Arabic/train.yaml
This repository was created by the ARBML team. If you have any suggestion or contribution feel free to make a pull request.
Contributors
Showing top 5 contributors by commit count.
