GitPedia

M2d

Masked Modeling Duo: Towards a Universal Audio Pre-training Framework

From nttcslabยทUpdated June 7, 2026ยทView on GitHubยท

This repository provides demo implementations of our paper "[Masked Modeling Duo: Towards a Universal Audio Pre-training Framework](https://ieeexplore.ieee.org/document/10502167)", "[M2D-CLAP: Exploring General-purpose Audio-Language Representations Beyond CLAP](https://ieeexplore.ieee.org/document/11168481)", "[Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input](https://arxiv.org/abs/2210.14648)", and so on. The project is written primarily in Jupyter Notebook, distributed under the Other license, first published in 2023. Key topics include: audio, masked-autoencoder, masked-modeling-duo, self-supervised-learning.

Latest release: v0.5.0โ€” M2D-CLAP (2025) Weights
September 18, 2025View Changelog โ†’
<table> <tr> <td align="center" width="50%"> <img src="examples/image-key-visual-m2d.jpg" alt="key_visual_M2D" width="95%"><br> Masked Modeling Duo (M2D) </td> <td align="center" width="50%"> <img src="examples/image-key-vis-m2d-clap.jpg" alt="key_visual_M2D-CLAP" width="73%"><br> M2D-CLAP </td> </tr> </table>

Masked Modeling Duo (M2D) & M2D-CLAP

This repository provides demo implementations of our paper "Masked Modeling Duo: Towards a Universal Audio Pre-training Framework", "M2D-CLAP: Exploring General-purpose Audio-Language Representations Beyond CLAP", "Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input", and so on.

๐ŸŒŸ Looking for the best general-purpose audio model? M2D-CLAP achieves state-of-the-art performance on audio tagging, zero-shot classification, and audio-language tasks โ€” try it instantly in Colab.

Quick Start

DescriptionNotebook
Audio tagging example (M2D)Open In Colab examples/Colab_M2D_example_Tagging.ipynb
Zero-shot ESC-50 classification with M2D-CLAPOpen In Colab examples/Colab_M2D-CLAP_ESC-50_ZS.ipynb
Audio feature visualization example with M2D-CLAPOpen In Colab examples/Colab_M2D-CLAP_ESC-50_VizualizeEmbs.ipynb

The example below uses M2D-CLAP, our recommended model. You can load it and encode audio in just a few lines:

python
# ๐Ÿ”Š Load model from examples.portable_m2d import PortableM2D # portable_m2d: a simple one-file loader model = PortableM2D('m2d_clap_vit_base-80x1001p16x16p16kpBpTI-2025/checkpoint-30.pth') # ๐ŸŽต Prepare input (three 10-s waveforms, range [-1., 1.]) import torch batch_audio = 2 * torch.rand((3, 10 * 16000)) - 1.0 # ๐Ÿ“ Encode โ†’ frame-level features frame_level = model(batch_audio) print(frame_level.shape) # torch.Size([3, 63, 3840]) # ๐Ÿ“ฆ Aggregate โ†’ clip-level features clip_level = torch.mean(frame_level, dim=1) print(clip_level.shape) # torch.Size([3, 3840])

Pre-trained/Fine-tuned Weights

AudioSet pre-trained weights

DescriptionRecommendationWeightFur-PT ReadyAS2M mAP
M2D-CLAP_2025 โญRecommended. Best for CLAP / audio tagging (AT) / sound event detection (SED).m2d_clap_vit_base-80x1001p16x16p16kpBpTI-2025โœ…0.490
M2D-CLAP_2024, additionally fine-tuned on AS2M2nd Best for AT/SED. (Encoder only)m2d_clap_vit_base-80x1001p16x16-240128_AS-FT_enconlyN/A0.485
M2D-AS fine-tuned on AS2M3rd best for AT/SED. (Encoder only)m2d_as_vit_base-80x1001p16x16-240213_AS-FT_enconlyN/A0.485
M2D/0.7 fine-tuned on AS2M4th best for AT/SED. (Encoder only)m2d_vit_base-80x1001p16x16-221006-mr7_as_46ab246dN/A0.479
M2D/0.7General-purpose transfer learning and further pre-training.m2d_vit_base-80x608p16x16-221006-mr7โœ…-
M2D/0.7General-purpose transfer learning. (Encoder only)m2d_vit_base-80x608p16x16-221006-mr7_enconlyN/A-
M2D/0.7General-purpose transfer learning. (Encoder only)m2d_vit_base-80x608p16x16-220930-mr7_enconlyN/A-
M2D/0.7 (t.f. 40ms)General-purpose transfer learning and further pre-training w/ finer time frame.m2d_vit_base-80x200p16x4-230529โœ…-
M2D-X/0.7 (ฮท= 0.3)The best ICBHI 2017 model in Section IV-E on the TASLP paper.m2d_x_icbhiN/A-
M2D/0.6General-purpose transfer learning and further pre-training.m2d_vit_base-80x608p16x16-221006-mr6โœ…-
M2D-CLAP_2024 (Older)General-purpose transfer learning and further pre-training, especially when application data is closer to the AudioSet ontology.m2d_clap_vit_base-80x608p16x16-240128โœ…-
M2D-ASGeneral-purpose transfer learning and further pre-training, especially when application data is closer to the AudioSet ontology.m2d_as_vit_base-80x608p16x16-240213โœ…-
MSM-MAE/0.75Predecessor to M2D; for reproducibility or comparison.msm_mae_vit_base-80x608p16x16-220924-mr75โœ…-
DescriptionRecommendationWeightFur-PT ReadyAS2M mAP
M2D-AS fine-tuned on AS2M@32kHzBest for audio tagging (AT) / sound event detection (SED) at 32 kHz.m2d_as_vit_base-80x1001p16x16p32k-240413_AS-FT_enconlyN/A0.480
M2D-AS@32kHzGeneral-purpose transfer learning at 32 kHz. (Encoder only)m2d_as_vit_base-80x608p16x16p32k-240413_enconlyN/A-

LibriSpeech pre-trained weights

DescriptionRecommendationWeightFur-PT ReadyAS2M mAP
M2D-S/0.6 6-s inputSpeech transfer learning and further pre-training.m2d_s_vit_base-80x608p80x2-230220โœ…-
M2D-S/0.6 5-s inputSpeech transfer learning and further pre-training.m2d_s_vit_base-80x512p80x2-230301โœ…-
M2D-S/0.6 4-s inputSpeech transfer learning and further pre-training.m2d_s_vit_base-80x400p80x2-230201โœ…-

Application Resources

๐Ÿ‘‰ Application Guide (alpha) is available. -- Our guidelines may provide useful information on how to plan further pre-train your models.

<figure> <a href="app/Guide_app.md"><img src="examples/image-AppGuideChart.png" alt="A guide chart", width="30%"></a> </figure>

A schematic illustration of M2D-X further pre-training:

<figure> <img src="examples/image-M2D-further-PT.svg" alt="A schematic illustration of M2D-X further pre-training", width="40%"> </figure>

1. Setup

The repository is based on the codes from facebookresearch/mae, and we patch our changes on these files.

  1. Download external source files and apply a patch.

    sh
    git clone https://github.com/nttcslab/m2d.git cd m2d curl -o util/lars.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/lars.py curl -o util/lr_decay.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/lr_decay.py curl -o util/lr_sched.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/lr_sched.py curl -o util/misc.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/misc.py curl -o util/analyze_repr.py https://raw.githubusercontent.com/daisukelab/general-learning/master/SSL/analyze_repr.py curl -o m2d/pos_embed.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/pos_embed.py curl -o train_audio.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py curl -o speech/train_speech.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py curl -o audioset/train_as.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py curl -o clap/clap_only.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py curl -o clap/train_clap.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py curl -o mae_train_audio.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py curl -o m2d/engine_pretrain_m2d.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/engine_pretrain.py curl -o m2d/models_mae.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/models_mae.py curl -o m2d/timm_layers_pos_embed.py https://raw.githubusercontent.com/huggingface/pytorch-image-models/e9373b1b925b2546706d78d25294de596bad4bfe/timm/layers/pos_embed.py patch -p1 < patch_m2d.diff
  2. Install external modules listed on requirements.txt.

    sh
    pip install -r requirements.txt

2. Evaluating M2D

We use the EVAR for our evaluation.

2-1. Setup EVAR

EVAR is an evaluation package for audio representations used by our research papers such as BYOL-A.

The following steps set up EVAR.

  1. In the folder of your copy of the M2D repository, clone the EVAR repository and prepare basic items.

    sh
    git clone https://github.com/nttcslab/eval-audio-repr.git evar cd evar curl https://raw.githubusercontent.com/daisukelab/general-learning/master/MLP/torch_mlp_clf2.py -o evar/utils/torch_mlp_clf2.py curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/sampler.py -o evar/sampler.py curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/cnn14_decoupled.py -o evar/cnn14_decoupled.py cd ..
  2. Set up downstream task datasets according to Preparing-datasets.md. The following is an example for setting up CREMA-D dataset.

    sh
    cd evar python evar/utils/download_cremad.py downloads/cremad python prepare_wav.py downloads/cremad work/16k/cremad 16000 cd ..

2-2. Linear Evaluation

Once you set up EVAR, you can evaluate your models as follows.

  • For evaluating a model with an absolute path /your/path/to/model.pth.

    sh
    cd evar python lineareval.py config/m2d.yaml cremad weight_file=/your/path/to/model.pth
  • If you want to save GPU memory, set a smaller batch size as follows. This example sets it as 16.

    sh
    cd evar python lineareval.py config/m2d.yaml cremad batch_size=16,weight_file=/your/path/to/model.pth

We used the all_eval.sh script to evaluate on all downstream tasks.

2-3. Fine-tuning

We have fine-tuned our models using the scripts in the util folder.

The following examples fine-tune on each downstream task three times with seed 42. Replace /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 with your actual model path.

sh
cd evar bash <path/to/m2d>/util/ft-as2m.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300 # AudioSet 2M bash <path/to/m2d>/util/ft-as0k.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300 # AudioSet 20K bash <path/to/m2d>/util/ft-esc50.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300 # ESC-50 bash <path/to/m2d>/util/ft-spc.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300 # Speech Commands bash <path/to/m2d>/util/ft-vc1.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300 # VoxCeleb1

NOTE: Please set your data path in util/ft-as2m.sh

The ft-as2m.sh requires the path to your log-mel spectrogram AudioSet samples in .npy. Update it with your data path before running.

3. Pre-training From Scratch

3-1. Prepare pre-training data samples

The pre-trainer (e.g., train_audio.py for audio) loads data from the data folder by default (--data_path), using a list of samples in a CSV data/files_audioset.csv by default (--csv_main).
Follow the steps in data/README.md.

The following is an example using the FSD50K dataset.

  1. Preprocess .wav files into log-mel spectrogram .npy files. The following converts from a source folder /your/local/fsd50k/FSD50K.dev_audio to a new folder data/fsd50k_lms.

    sh
    python wav_to_lms.py /your/local/fsd50k/FSD50K.dev_audio data/fsd50k_lms
  2. Create a CSV file that will be used as a list of pre-training samples, containing a single column file_name. The following example creates files_f_s_d_5_0_k.csv.

    sh
    echo file_name > data/files_f_s_d_5_0_k.csv (cd data && find fsd50k_lms/FSD50K.dev_audio -name "*.npy") >> data/files_f_s_d_5_0_k.csv

Example of created folder structure:

data/
    files_f_s_d_5_0_k.csv
    fsd50k_lms/
        FSD50K.dev_audio/
            2931.npy
            408195.npy
                :

3-2. Start pre-training

Once your data is ready, start pre-training as follows.

sh
python train_audio.py --csv_main data/files_f_s_d_5_0_k.csv

3-3. Evaluation during and after the training

The training loop automatically evaluates the pre-trained model.

  • During pre-training, train_audio.py runs a script called quick_eval.sh as a sub-process. You can edit quick_eval.sh for your purposes.
  • When the pre-training is finished, the final evaluation script all_eval.sh is executed.

3-4. Complete pre-training command lines

The command lines for pre-training full-performance models follow:

sh
# M2D OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 -m train_audio --input_size 80x608 --patch_size 16x16 --epochs 300 --batch_size 512 --accum_iter 1 --save_freq 50 --seed 3 --model m2d_vit_base --csv_main data/files_audioset.csv --data_path /path/to/your/data --loss_off 0. # M2D-AS OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 -m audioset.train_as --input_size 80x608 --patch_size 16x16 --epochs 300 --batch_size 512 --accum_iter 1 --save_freq 50 --seed 3 --data_path /path/to/your/data --loss_off 1.

Note: Replace /path/to/your/data with the path to your LMS data directory. Placing data on fast storage (SSD recommended) significantly speeds up training. If --data_path is omitted, the data/ directory at the repository root is used.

Example logs are available: example_logs.zip.

We explain the details in the Guide_app.md.

For other model variants, see also:

4. Other Pre-trained/fine-tuned Weights

Please find all pre-trained/fine-tuned weights published on the releases.

5. License

See LICENSE.pdf for details.

Citations

If you find our M2D or M2D-CLAP useful in your research, please consider citing our papers.

BibTeX
@article{niizumi2025m2d-clap, author = {Niizumi, Daisuke and Takeuchi, Daiki and Yasuda, Masahiro and Nguyen, Binh Thien and Ohishi, Yasunori and Harada, Noboru}, journal = {IEEE Access}, title = {{M2D-CLAP: Exploring General-purpose Audio-Language Representations Beyond CLAP}}, year = {2025}, volume = {13}, pages = {163313-163330}, doi={10.1109/ACCESS.2025.3611348}} @article{niizumi2024m2dx, title = {{Masked Modeling Duo: Towards a Universal Audio Pre-training Framework}}, author = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino}, journal = {IEEE/ACM Trans. Audio, Speech, Language Process.}, year = {2024}, volume = {32}, pages = {2391-2406}, url = {https://ieeexplore.ieee.org/document/10502167}, doi = {10.1109/TASLP.2024.3389636}} @inproceedings{niizumi2024m2d-clap, title = {{M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation}}, author = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Masahiro Yasuda and Shunsuke Tsubaki and Keisuke Imoto}, booktitle={Interspeech}, year = {2024}, pages = {57--61}, doi = {10.21437/Interspeech.2024-29}} @inproceedings{niizumi2023m2d, title = {{Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input}}, author = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino}, booktitle={ICASSP}, year = {2023}, url = {https://ieeexplore.ieee.org/document/10097236}, doi = {10.1109/ICASSP49357.2023.10097236}} @inproceedings{niizumi2023m2d4speech, title = {{Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation}}, author = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino}, year = {2023}, booktitle={Interspeech}, pages = {1294--1298}, doi = {10.21437/Interspeech.2023-221}} @inproceedings{niizumi2024embc, title = {{Exploring Pre-trained General-purpose Audio Representations for Heart Murmur Detection}}, author = {Niizumi, Daisuke and Takeuchi, Daiki and Ohishi, Yasunori and Harada, Noboru and Kashino, Kunio}, booktitle={EMBC}, year = {2024}, pages = {1-4}, doi = {10.1109/EMBC53108.2024.10782479}}

Acknowledgements

We appreciate these publicly available implementations and all the modules our experiments heavily depend on!

References

Contributors

Showing top 1 contributor by commit count.

View all contributors on GitHub โ†’

This article is auto-generated from nttcslab/m2d via the GitHub API.Last fetched: 6/19/2026