<table> <tr> <td align="center" width="50%"> <img src="examples/image-key-visual-m2d.jpg" alt="key_visual_M2D" width="95%"><br> Masked Modeling Duo (M2D) </td> <td align="center" width="50%"> <img src="examples/image-key-vis-m2d-clap.jpg" alt="key_visual_M2D-CLAP" width="73%"><br> M2D-CLAP </td> </tr> </table>

Masked Modeling Duo (M2D) & M2D-CLAP

This repository provides demo implementations of our paper "Masked Modeling Duo: Towards a Universal Audio Pre-training Framework", "M2D-CLAP: Exploring General-purpose Audio-Language Representations Beyond CLAP", "Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input", and so on.

🌟 Looking for the best general-purpose audio model? M2D-CLAP achieves state-of-the-art performance on audio tagging, zero-shot classification, and audio-language tasks — try it instantly in Colab.

Quick Start

Description	Notebook
Audio tagging example (M2D)	examples/Colab_M2D_example_Tagging.ipynb
Zero-shot ESC-50 classification with M2D-CLAP	examples/Colab_M2D-CLAP_ESC-50_ZS.ipynb
Audio feature visualization example with M2D-CLAP	examples/Colab_M2D-CLAP_ESC-50_VizualizeEmbs.ipynb

The example below uses M2D-CLAP, our recommended model. You can load it and encode audio in just a few lines:

python
# 🔊 Load model
from examples.portable_m2d import PortableM2D  # portable_m2d: a simple one-file loader
model = PortableM2D('m2d_clap_vit_base-80x1001p16x16p16kpBpTI-2025/checkpoint-30.pth')

# 🎵 Prepare input (three 10-s waveforms, range [-1., 1.])
import torch
batch_audio = 2 * torch.rand((3, 10 * 16000)) - 1.0

# 📐 Encode → frame-level features
frame_level = model(batch_audio)
print(frame_level.shape)  # torch.Size([3, 63, 3840])

# 📦 Aggregate → clip-level features
clip_level = torch.mean(frame_level, dim=1)
print(clip_level.shape)  # torch.Size([3, 3840])

Pre-trained/Fine-tuned Weights

AudioSet pre-trained weights

Description	Recommendation	Weight	Fur-PT Ready	AS2M mAP
M2D-CLAP_2025 ⭐	Recommended. Best for CLAP / audio tagging (AT) / sound event detection (SED).	m2d_clap_vit_base-80x1001p16x16p16kpBpTI-2025	✅	0.490
M2D-CLAP_2024, additionally fine-tuned on AS2M	2nd Best for AT/SED. (Encoder only)	m2d_clap_vit_base-80x1001p16x16-240128_AS-FT_enconly	N/A	0.485
M2D-AS fine-tuned on AS2M	3rd best for AT/SED. (Encoder only)	m2d_as_vit_base-80x1001p16x16-240213_AS-FT_enconly	N/A	0.485
M2D/0.7 fine-tuned on AS2M	4th best for AT/SED. (Encoder only)	m2d_vit_base-80x1001p16x16-221006-mr7_as_46ab246d	N/A	0.479
M2D/0.7	General-purpose transfer learning and further pre-training.	m2d_vit_base-80x608p16x16-221006-mr7	✅	-
M2D/0.7	General-purpose transfer learning. (Encoder only)	m2d_vit_base-80x608p16x16-221006-mr7_enconly	N/A	-
M2D/0.7	General-purpose transfer learning. (Encoder only)	m2d_vit_base-80x608p16x16-220930-mr7_enconly	N/A	-
M2D/0.7 (t.f. 40ms)	General-purpose transfer learning and further pre-training w/ finer time frame.	m2d_vit_base-80x200p16x4-230529	✅	-
M2D-X/0.7 (η= 0.3)	The best ICBHI 2017 model in Section IV-E on the TASLP paper.	m2d_x_icbhi	N/A	-
M2D/0.6	General-purpose transfer learning and further pre-training.	m2d_vit_base-80x608p16x16-221006-mr6	✅	-
M2D-CLAP_2024 (Older)	General-purpose transfer learning and further pre-training, especially when application data is closer to the AudioSet ontology.	m2d_clap_vit_base-80x608p16x16-240128	✅	-
M2D-AS	General-purpose transfer learning and further pre-training, especially when application data is closer to the AudioSet ontology.	m2d_as_vit_base-80x608p16x16-240213	✅	-
MSM-MAE/0.75	Predecessor to M2D; for reproducibility or comparison.	msm_mae_vit_base-80x608p16x16-220924-mr75	✅	-

Description	Recommendation	Weight	Fur-PT Ready	AS2M mAP
M2D-AS fine-tuned on AS2M@32kHz	Best for audio tagging (AT) / sound event detection (SED) at 32 kHz.	m2d_as_vit_base-80x1001p16x16p32k-240413_AS-FT_enconly	N/A	0.480
M2D-AS@32kHz	General-purpose transfer learning at 32 kHz. (Encoder only)	m2d_as_vit_base-80x608p16x16p32k-240413_enconly	N/A	-

LibriSpeech pre-trained weights

Description	Recommendation	Weight	Fur-PT Ready	AS2M mAP
M2D-S/0.6 6-s input	Speech transfer learning and further pre-training.	m2d_s_vit_base-80x608p80x2-230220	✅	-
M2D-S/0.6 5-s input	Speech transfer learning and further pre-training.	m2d_s_vit_base-80x512p80x2-230301	✅	-
M2D-S/0.6 4-s input	Speech transfer learning and further pre-training.	m2d_s_vit_base-80x400p80x2-230201	✅	-

Application Resources

👉 Application Guide (alpha) is available. -- Our guidelines may provide useful information on how to plan further pre-train your models.

👉 Resources for M2D-CLAP (General-purpose Audio-Language Representation).
👉 Resources for M2D-X medical applications (ICBHI2017/SPRSound), further pre-training examples.
👉 Resources for M2D medical application (CirCor DigiScope heart sound).
👉 Resources for M2D-AS (M2D-X specialized in AudioSet).
👉 Resources for M2D-S (M2D-X specialized in Speech).
👉 Resources for M2D on respiratory sound analysis (OPERA benchmark) — Pre-training and evaluation resources for respiratory sounds using M2D, hosted in the EVAR repository (see our Interspeech 2025 paper).
👉 Resources for M2D on music understanding (MARBLE benchmark) — Integration of M2D with the MARBLE music benchmark, hosted in the EVAR repository (covered in the M2D-CLAP 2025 paper).
👉 MSM-MAE pre-training (predecessor to M2D) — Pre-training guide for MSM-MAE, the model that M2D builds upon. Provided for reproducibility and comparison.

A schematic illustration of M2D-X further pre-training:

1. Setup

The repository is based on the codes from facebookresearch/mae, and we patch our changes on these files.

Download external source files and apply a patch.

sh
git clone https://github.com/nttcslab/m2d.git
cd m2d
curl -o util/lars.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/lars.py
curl -o util/lr_decay.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/lr_decay.py
curl -o util/lr_sched.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/lr_sched.py
curl -o util/misc.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/misc.py
curl -o util/analyze_repr.py https://raw.githubusercontent.com/daisukelab/general-learning/master/SSL/analyze_repr.py
curl -o m2d/pos_embed.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/pos_embed.py
curl -o train_audio.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py
curl -o speech/train_speech.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py
curl -o audioset/train_as.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py
curl -o clap/clap_only.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py
curl -o clap/train_clap.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py
curl -o mae_train_audio.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py
curl -o m2d/engine_pretrain_m2d.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/engine_pretrain.py
curl -o m2d/models_mae.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/models_mae.py
curl -o m2d/timm_layers_pos_embed.py https://raw.githubusercontent.com/huggingface/pytorch-image-models/e9373b1b925b2546706d78d25294de596bad4bfe/timm/layers/pos_embed.py
patch -p1 < patch_m2d.diff

Install external modules listed on requirements.txt.
```
sh
pip install -r requirements.txt
```

2. Evaluating M2D

We use the EVAR for our evaluation.

2-1. Setup EVAR

EVAR is an evaluation package for audio representations used by our research papers such as BYOL-A.

The following steps set up EVAR.

In the folder of your copy of the M2D repository, clone the EVAR repository and prepare basic items.

sh
git clone https://github.com/nttcslab/eval-audio-repr.git evar
cd evar
curl https://raw.githubusercontent.com/daisukelab/general-learning/master/MLP/torch_mlp_clf2.py -o evar/utils/torch_mlp_clf2.py
curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/sampler.py -o evar/sampler.py
curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/cnn14_decoupled.py -o evar/cnn14_decoupled.py
cd ..

Set up downstream task datasets according to Preparing-datasets.md. The following is an example for setting up CREMA-D dataset.

sh
cd evar
python evar/utils/download_cremad.py downloads/cremad
python prepare_wav.py downloads/cremad work/16k/cremad 16000
cd ..

2-2. Linear Evaluation

Once you set up EVAR, you can evaluate your models as follows.

For evaluating a model with an absolute path /your/path/to/model.pth.

sh
cd evar
python lineareval.py config/m2d.yaml cremad weight_file=/your/path/to/model.pth

If you want to save GPU memory, set a smaller batch size as follows. This example sets it as 16.

sh
cd evar
python lineareval.py config/m2d.yaml cremad batch_size=16,weight_file=/your/path/to/model.pth

We used the all_eval.sh script to evaluate on all downstream tasks.

2-3. Fine-tuning

We have fine-tuned our models using the scripts in the util folder.

The following examples fine-tune on each downstream task three times with seed 42. Replace /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 with your actual model path.

sh
cd evar
bash <path/to/m2d>/util/ft-as2m.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300  # AudioSet 2M
bash <path/to/m2d>/util/ft-as0k.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300  # AudioSet 20K
bash <path/to/m2d>/util/ft-esc50.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300  # ESC-50
bash <path/to/m2d>/util/ft-spc.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300  # Speech Commands
bash <path/to/m2d>/util/ft-vc1.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300  # VoxCeleb1

NOTE: Please set your data path in `util/ft-as2m.sh`

The ft-as2m.sh requires the path to your log-mel spectrogram AudioSet samples in .npy. Update it with your data path before running.

3. Pre-training From Scratch

3-1. Prepare pre-training data samples

The pre-trainer (e.g., train_audio.py for audio) loads data from the data folder by default (--data_path), using a list of samples in a CSV data/files_audioset.csv by default (--csv_main).
Follow the steps in data/README.md.

The following is an example using the FSD50K dataset.

Preprocess .wav files into log-mel spectrogram .npy files. The following converts from a source folder /your/local/fsd50k/FSD50K.dev_audio to a new folder data/fsd50k_lms.
```
sh
python wav_to_lms.py /your/local/fsd50k/FSD50K.dev_audio data/fsd50k_lms
```
Create a CSV file that will be used as a list of pre-training samples, containing a single column file_name. The following example creates files_f_s_d_5_0_k.csv.
```
sh
echo file_name > data/files_f_s_d_5_0_k.csv
(cd data && find fsd50k_lms/FSD50K.dev_audio -name "*.npy") >> data/files_f_s_d_5_0_k.csv
```

Example of created folder structure:

data/
    files_f_s_d_5_0_k.csv
    fsd50k_lms/
        FSD50K.dev_audio/
            2931.npy
            408195.npy
                :

3-2. Start pre-training

Once your data is ready, start pre-training as follows.

sh
python train_audio.py --csv_main data/files_f_s_d_5_0_k.csv

3-3. Evaluation during and after the training

The training loop automatically evaluates the pre-trained model.

During pre-training, train_audio.py runs a script called quick_eval.sh as a sub-process. You can edit quick_eval.sh for your purposes.
When the pre-training is finished, the final evaluation script all_eval.sh is executed.

3-4. Complete pre-training command lines

The command lines for pre-training full-performance models follow:

sh
# M2D
OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 -m train_audio --input_size 80x608 --patch_size 16x16 --epochs 300 --batch_size 512 --accum_iter 1 --save_freq 50 --seed 3 --model m2d_vit_base --csv_main data/files_audioset.csv --data_path /path/to/your/data --loss_off 0.
# M2D-AS
OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 -m audioset.train_as --input_size 80x608 --patch_size 16x16 --epochs 300 --batch_size 512 --accum_iter 1 --save_freq 50 --seed 3 --data_path /path/to/your/data --loss_off 1.

Note: Replace /path/to/your/data with the path to your LMS data directory. Placing data on fast storage (SSD recommended) significantly speeds up training. If --data_path is omitted, the data/ directory at the repository root is used.

Example logs are available: example_logs.zip.

We explain the details in the Guide_app.md.

For other model variants, see also:

M2D-CLAP pre-training — multi-stage training for audio-language representation
MSM-MAE pre-training — predecessor to M2D

4. Other Pre-trained/fine-tuned Weights

Please find all pre-trained/fine-tuned weights published on the releases.

5. License

See LICENSE.pdf for details.

Citations

If you find our M2D or M2D-CLAP useful in your research, please consider citing our papers.

BibTeX
@article{niizumi2025m2d-clap,
    author  = {Niizumi, Daisuke and Takeuchi, Daiki and Yasuda, Masahiro and Nguyen, Binh Thien and Ohishi, Yasunori and Harada, Noboru},
    journal = {IEEE Access}, 
    title   = {{M2D-CLAP: Exploring General-purpose Audio-Language Representations Beyond CLAP}}, 
    year    = {2025},
    volume  = {13},
    pages   = {163313-163330},
    doi={10.1109/ACCESS.2025.3611348}}

@article{niizumi2024m2dx,
    title   = {{Masked Modeling Duo: Towards a Universal Audio Pre-training Framework}},
    author  = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
    journal = {IEEE/ACM Trans. Audio, Speech, Language Process.},
    year    = {2024},
    volume  = {32},
    pages   = {2391-2406},
    url     = {https://ieeexplore.ieee.org/document/10502167},
    doi     = {10.1109/TASLP.2024.3389636}}

@inproceedings{niizumi2024m2d-clap,
    title   = {{M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation}},
    author  = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Masahiro Yasuda and Shunsuke Tsubaki and Keisuke Imoto},
    booktitle={Interspeech},
    year    = {2024},
    pages   = {57--61},
    doi     = {10.21437/Interspeech.2024-29}}

@inproceedings{niizumi2023m2d,
    title   = {{Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input}},
    author  = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
    booktitle={ICASSP}, 
    year    = {2023},
    url     = {https://ieeexplore.ieee.org/document/10097236},
    doi     = {10.1109/ICASSP49357.2023.10097236}}

@inproceedings{niizumi2023m2d4speech,
    title   = {{Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation}},
    author  = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
    year    = {2023},
    booktitle={Interspeech},
    pages   = {1294--1298},
    doi     = {10.21437/Interspeech.2023-221}}

@inproceedings{niizumi2024embc,
    title   = {{Exploring Pre-trained General-purpose Audio Representations for Heart Murmur Detection}},
    author  = {Niizumi, Daisuke and Takeuchi, Daiki and Ohishi, Yasunori and Harada, Noboru and Kashino, Kunio},
    booktitle={EMBC},
    year    = {2024},
    pages   = {1-4},
    doi     = {10.1109/EMBC53108.2024.10782479}}

Acknowledgements

Our code is based on the MAE PyTorch/GPU re-implementation of the paper Masked Autoencoders Are Scalable Vision Learners.
We use nnAudio (KinWaiCheuk/nnAudio) for converting raw audio into log-mel spectrogram.

We appreciate these publicly available implementations and all the modules our experiments heavily depend on!

M2d

Masked Modeling Duo (M2D) & M2D-CLAP

Quick Start

Pre-trained/Fine-tuned Weights

AudioSet pre-trained weights

LibriSpeech pre-trained weights

Application Resources

1. Setup

2. Evaluating M2D

2-1. Setup EVAR

2-2. Linear Evaluation

2-3. Fine-tuning

NOTE: Please set your data path in `util/ft-as2m.sh`

3. Pre-training From Scratch

3-1. Prepare pre-training data samples

3-2. Start pre-training

3-3. Evaluation during and after the training

3-4. Complete pre-training command lines

4. Other Pre-trained/fine-tuned Weights

5. License

Citations

Acknowledgements

References

Contributors

Masked Modeling Duo (M2D) & M2D-CLAP

Quick Start

Pre-trained/Fine-tuned Weights

AudioSet pre-trained weights

LibriSpeech pre-trained weights

Application Resources

1. Setup

2. Evaluating M2D

2-1. Setup EVAR

2-2. Linear Evaluation

2-3. Fine-tuning

NOTE: Please set your data path in util/ft-as2m.sh

3. Pre-training From Scratch

3-1. Prepare pre-training data samples

3-2. Start pre-training

3-3. Evaluation during and after the training

3-4. Complete pre-training command lines

4. Other Pre-trained/fine-tuned Weights

5. License

Citations

Acknowledgements

References

Contributors

Related Repositories

NOTE: Please set your data path in `util/ft-as2m.sh`