GitPedia

Msm mae

Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representations

From nttcslabยทUpdated May 9, 2026ยทView on GitHubยท

> ๐Ÿšจ **Important Notice: A newer, better implementation is available.** ๐Ÿšจ > > ๐ŸŽ‰ The successor to this repository, **[Masked Modeling Duo (M2D)](https://github.com/nttcslab/m2d)**, is now available. > **If you are starting a new project, please use M2D instead of this repository.** > > - ๐Ÿ† **M2D pre-trained weights significantly outperform those provided here.** See the results below. > - ๐Ÿ“ฆ **The M2D repository also distributes MSM-MAE pre-trained weights**, and they significantly outperform ... The project is written primarily in Jupyter Notebook, distributed under the Other license, first published in 2022. Key topics include: audio, deep-learning, masked-autoencoder.

Latest release: v0.0.1โ€” Initial release

key_visual

Masked Spectrogram Modeling using Masked Autoencoders (MSM-MAE)

๐Ÿšจ Important Notice: A newer, better implementation is available. ๐Ÿšจ

๐ŸŽ‰ The successor to this repository, Masked Modeling Duo (M2D), is now available.
If you are starting a new project, please use M2D instead of this repository.

  • ๐Ÿ† M2D pre-trained weights significantly outperform those provided here. See the results below.
  • ๐Ÿ“ฆ The M2D repository also distributes MSM-MAE pre-trained weights, and they significantly outperform the weights in this repositoryโ€”even as MSM-MAE models.
  • ๐Ÿ” M2D supports training MSM-MAE models as well as M2D models.
  • ๐Ÿ—‚๏ธ This repository remains available as the reference implementation accompanying the original MSM-MAE paper.

๐Ÿ‘‰ Go to the M2D repository: https://github.com/nttcslab/m2d ๐Ÿ‘ˆ

๐Ÿค” Why M2D?

The table below compares EVAR benchmark results between MSM-MAE (this repository) and M2D models. M2D consistently and substantially outperforms MSM-MAE across all tasks.

new results

M2D improves upon MSM-MAE in two key ways:

  • ๐Ÿง  Instead of reconstructing masked patches in input space, M2D computes the loss in feature space using a momentum encoderโ€”leading to richer representations.
  • ๐Ÿ“ˆ M2D achieves higher scores across all downstream tasks evaluated in our papers.

For training MSM-MAE-style models, M2D's codebase supports MSM-MAE training as well, so there is no reason to use this older implementation for new work. ๐Ÿ™Œ


About This Repository (Reference / Archive)

This repository is the original implementation of Masked Spectrogram Modeling using Masked Autoencoders (MSM-MAE), a self-supervised learning method for general-purpose audio representation. It includes:

  • Training code that can pre-train models with arbitrary audio files.
  • Evaluation code to test models under two benchmarks, HEAR 2021 and EVAR.
  • Visualization examples and a notebook.
  • Pre-trained weights.

If you find MSM-MAE useful in your research, please use the following BibTeX entry for citation.

BibTeX
@InProceedings{niizumi2022masked, title = {Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation}, author = {Niizumi, Daisuke and Takeuchi, Daiki and Ohishi, Yasunori and Harada, Noboru and Kashino, Kunio}, booktitle = {HEAR: Holistic Evaluation of Audio Representations (NeurIPS 2021 Competition)}, pages = {1--24}, year = {2022}, editor = {Turian, Joseph and Schuller, Bjรถrn W. and Herremans, Dorien and Kirchoff, Katrin and Perera, Paola Garcia and Esling, Philippe}, volume = {166}, series = {Proceedings of Machine Learning Research}, month = {13--14 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v166/niizumi22a/niizumi22a.pdf}, url = {https://proceedings.mlr.press/v166/niizumi22a.html} }

History

  • UPDATE (Apr, 2023): Masked Modeling Duo (M2D) is linked to this repository.
  • UPDATE (Jan, 2023): PMLR paper is out. Replaced BibTeX and URL links with the PMLR website. Also fixed a bug.
  • UPDATE (Dec, 2022): Added Vizualization & Audio example notebook. Now we can listen ๐Ÿ‘‚ to how the reconstruction results sound!?
  • UPDATE (Nov, 2022): Extended runtime inference 'encode_lms()' to output features for each layer.

1. Getting Started

The repository relies on the codes from facebookresearch/mae, and we patch our changes on these files.

  1. Download external source files from facebookresearch/mae, and apply a patches.
sh
curl -o util/lars.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/lars.py curl -o util/lr_decay.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/lr_decay.py curl -o util/lr_sched.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/lr_sched.py curl -o util/misc.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/misc.py curl -o msm_mae/pos_embed.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/pos_embed.py curl -o main_pretrain.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/main_pretrain.py curl -o msm_mae/engine_pretrain.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/engine_pretrain.py curl -o msm_mae/models_mae.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/models_mae.py cd msm_mae patch -p1 < patch_msm_mae.diff cd .. patch -p1 < patch_main.diff
  1. If you need a clean environment, the following anaconda example creates a new environment named ar:
sh
conda create -n ar python==3.8 conda activate ar
  1. Install external modules listed on requirements.txt.

1-1. Quick example

We have a utility runtime model, RuntimeMAE, which helps you to load a pre-trained model and encode your audios.

python
from msm_mae.runtime import RuntimeMAE device = torch.device('cuda') # Prepare your batch of audios. This is a dummy example of three 10s waves. batch_audio = 2 * torch.rand((3, 10 * 16000)) - 1.0 # input range = [-1., 1] batch_audio = batch_audio.to(device) # Create a model with pretrained weights. runtime = RuntimeMAE(weight_file='80x512p16x16_paper/checkpoint-100.pth') runtime = runtime.to(device) # Encode raw audio into features. `encode()` will process automatically: # 1. Convert the input `batch_audio` to log-mel spectrograms (LMS). # 2. Normalize the batch LMS with mean and std calculated from the batch. # 3. Encode to feature. frame_level = runtime.encode(batch_audio) # This example ends up with frame-level 3840-d feature vectors for 63 time frames. # The `frame_level` will have a size of torch.Size([3, 63, 3840]). print(frame_level.shape) # You can get clip-level features by taking average of time franes. # The `clip_level` will have a size of torch.Size([3, 3840]) clip_level = torch.mean(frame_level, dim=1) print(clip_level.shape)

To get the best features, you can normalize your audio with normalization statistics of your entire input data and use them in your pipeline.

python
# Calculate statistics in advance. This is an example with 10 random waves. means, stds = [], [] for _ in range(10): lms = runtime.to_feature(torch.rand((10 * 16000)).to(device)) means.append(lms.mean()) stds.append(lms.std()) dataset_mean, dataset_std = torch.mean(torch.stack(means)), torch.mean(torch.stack(stds)) # These can be numbers [-5.4919195, 5.0389895], for example. # The followings are an example pipeline. # Convert your batch audios into LMS. batch_lms = runtime.to_feature(batch_audio) # Normalize them. batch_lms = (batch_lms - dataset_mean) / (dataset_std + torch.finfo().eps) # Encode them to feame-level features. frame_level = model.encode_lms(batch_lms) # Calculate clip-level features if needed. clip_level = torch.mean(frame_level, dim=1)

To get features per layer, you can add return_layers=True.

python
# Getting features per layer. y = rt.encode_lms(torch.rand(10, 1, 80, 900), return_layers=True) print(len(y), y[0].shape) # --> 12 torch.Size([10, 57, 3840]) # As normal, getting finale-layer features. y = rt.encode_lms(torch.rand(10, 1, 80, 900), return_layers=False) print(y.shape) # --> torch.Size([10, 57, 3840])

2. Evaluating MSM-MAE

2-1. Evaluating on HEAR 2021 NeurIPS Challenge Tasks

We evaluate our models on our paper using hear-eval-kit from on HEAR 2021 NeurIPS Challenge as follows.

NOTE: The folder hear has all the files we need to evaluate models on hear-eval-kit.

  1. Install hear-eval-kit as follows:

    sh
    pip install heareval
  2. Download your copy of downstream dataset for 16 kHz. See HEAR NeurIPS 2021 Datasets@zenodo for the detail.

  3. To evaluate our models, we need a local package which loads and runs our models. The followings shows an example for a model named 80x208p16x16_mymodel.

    sh
    cd hear/hear_msm cp sample.py 80x208p16x16_mymodel.py ** Edit the 80x208p16x16_mymodel.py here, so that the value of `model_path` points to your model with an absolute path. ** cd .. pip install -e .
  4. We are on the folder hear. Run the hear-eval-kit with your pre-trained model.

    sh
    python -m heareval.embeddings.runner hear_msm.80x208p16x16_mymodel --tasks-dir /your_copy/hear/tasks CUBLAS_WORKSPACE_CONFIG=:4096:8 python -m heareval.predictions.runner embeddings/hear_msm.80x208p16x16_mymodel/*

2-2. Evaluating on EVAR

EVAR is an evaluation package for audio representations used by our research papers such as BYOL-A and Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model.
It supports the following downstream tasks: ESC-50, US8K, FSD50K, SPCV1/V2, VoxForge, VoxCeleb1, CREMA-D, GTZAN, NSynth instrument family, Pitch Audio Dataset (Surge synthesizer).

The following steps setup MSM-MAE on the EVAR.

  1. Clone EVAR repository and prepare basic items.

    sh
    git clone https://github.com/nttcslab/eval-audio-repr.git evar cd evar curl https://raw.githubusercontent.com/daisukelab/general-learning/master/MLP/torch_mlp_clf2.py -o evar/utils/torch_mlp_clf2.py curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/sampler.py -o evar/sampler.py curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/cnn14_decoupled.py -o evar/cnn14_decoupled.py cd ..
  2. Install MSM-MAE files on the cloned evar folder.

    sh
    ln -s ../../to_evar/ar_msm_mae.py evar/evar ln -s ../../to_evar/msm_mae.yaml evar/config cd evar sed 's/import evar\.ar_byola/import evar\.ar_byola, evar\.ar_msm_mae/' -i lineareval.py cd ..
  3. Setup downstream task datasets according to Preparing-datasets.md. The following is an example for setting up CREMA-D dataset.

    sh
    cd evar python evar/utils/download_cremad.py downloads/cremad python prepare_wav.py downloads/cremad work/16k/cremad 16000 cd ..

Once setup is complete, you can evaluate your models as follows.

  • For evaluating a model with an absolute path /your/path/to/model.pth.

    sh
    cd evar python lineareval.py config/msm_mae.yaml cremad weight_file=/your/path/to/model.pth
  • If you want to save GPU memory, set a fewer batch size as follows. This example sets it as 16.

    sh
    cd evar python lineareval.py config/msm_mae.yaml cremad batch_size=16,weight_file=/your/path/to/model.pth

3. Training From Scratch

First, you need to prepare pre-training data samples; then, you can pre-train your model.
The following is an example to pre-train a model on FSD50K dataset.

  1. Preprocess .wav files into log-mel spectrogram .npy files. The following converts from a source folder /your/local/fsd50k/FSD50K.dev_audio to a new folder lms_fsd50kdev.

    sh
    python wav_to_lms.py /your/local/fsd50k/FSD50K.dev_audio lms_fsd50kdev
  2. Create a CSV file which is used as a list of pre-training samples containing a single column file_name. The following example creates trainingfiles.csv.

    sh
    echo file_name > trainingfiles.csv (cd lms_fsd50kdev && find . -name "*.npy") >> trainingfiles.csv
  3. Make some copies of samples for visualization. The following example makes 3 symbolic links. All the samples under the folder vis_samples will be used to visualize reconstruction results of a checkpoint during training.

    sh
    mkdir -p lms_fsd50kdev/vis_samples ln -s ../1240.npy lms_fsd50kdev/vis_samples/1240.npy ln -s ../106237.npy lms_fsd50kdev/vis_samples/106237.npy ln -s ../119949.npy lms_fsd50kdev/vis_samples/119949.npy
  4. Once your data is ready, start pre-training as follows.

    sh
    python main_pretrain.py 80x208p16x16_your_test --input_size 80x208 --data_path lms_fsd50kdev --dataset trainingfiles --model=mae_vit_base_patch16x16

3-1. About run folders

The training loop creates a folder called the run folder to store artifacts of your training run, such as a log, visualizations, and checkpoints.
In the example above, the 80x208p16x16_your_test is the run folder name which has a format below:

<input_size>p<patch_size>_<your_free_string>
 - input_size: Two integers concatenated with `x`.
 - patch_size: Two integers concatenated with `x`.
 - your_free_string: Optional.

NOTE: When a model is evaluated by EVAR or hear-eval-kit, the input and patch size parameters written in the run name are used.

3-2. Evalidation during training

The training loop takes two actions for evaluating checkpoints during training: visualization and calling an external command quick_eval.sh (EVAR by default).

  • While training, main_pretrain.py will call a script named quick_eval.sh for every 20 epochs by default. You can edit quick_eval.sh for your purpose.

  • The training loop also visualizes reconstruction results using checkpoints. You can find them under the sub-folder of the run folder, such as 80x208p16x16_tttt/19. The folder contains raw results in .npy and image files in .png as the following examples.

    <table><tbody><tr> <td><img src="misc/recon_126AbihZt28.png" alt="recon_126AbihZt28" width="250"/></td> <td><img src="misc/recon_A1TNpj9hHW0.png" alt="recon_A1TNpj9hHW0" width="250"/></td> <td><img src="misc/recon_v1Xn9GONUDE.png" alt="recon_v1Xn9GONUDE" width="250"/></td> </tr></tbody></table>

4. Visualization & Audio Examples

๐Ÿ‘‰ Visualization example notebook and Vizualization & Audio example notebook are available. ๐Ÿ‘ˆ

The Visualization example notebook shows how to visualize reconstruction results as well as attention maps.

  • Download AudioSetWav16k_examples.zip that contains example wave samples from the releases and unzip the zip file in the misc folder beforehand.

Here are reconstruction examples:

recon512

Here are attention map examples:

attn512

In addition, Vizualization & Audio example notebook shows how we can invert these log-mel spectrograms to audios using librosa.feature.inverse.mel_to_audio. Two examples from the notebook follow:

5. Pre-trained Weights and Network Structure Details

Three pre-trained weights are published on the releases,
and the followings are their EVAR task results:

weightesc50us8kspcv2vc1voxforgecremadgtzannsynthsurge
80x512p16x16_paper89.1%84.3%93.1%56.1%97.5%70.1%79.3%76.9%37.5%
80x512p16x16_042588.9%84.0%92.4%56.5%96.8%67.6%81.4%76.1%37.4%
80x208p16x16_042287.1%82.2%91.1%49.7%95.6%66.9%75.9%76.5%37.8%
  • 80x512p16x16_paper: This weight is the pre-trained weight used to calculate the numbers in the paper, trained on our old unpolished code. Cannot be used for visualizations due to a minor difference in the decoder structure.
  • 80x512p16x16_0425: This weight is a pre-trained weight that is trained with the current code to check reproducibility.
  • 80x208p16x16_0422: This weight is a pre-trained weight that is trained with the current code to check reproducibility.

FYI: You can check a notebook that summarizes EVAR task results.

5-1. Network Structure Details

Our ViT network implementation has three minor differences from the original MAE implementation, due to historical reason. (We have already been testing based on pengzhiliang/MAE-pytorch in late 2021, before the MAE implementation became available.)

  • Our implementation follows BEiT, where the k bias is always zero in the Attention calculation. Though it is explained to be equivalent in terms of calculation results, we keep this implementation in our network for better reproducibility.
  • [CLS] token is removable by the option --no_cls_token.
  • The decoder uses 1d positional embedding. You can switch to the 2d version by the option --dec_pos_2d.

6. License

See LICENSE for details.

Acknowledgements

We appreciate these publicly available implementations and all the modules our experiments heavily depend on!

References

Contributors

Showing top 2 contributors by commit count.

View all contributors on GitHub โ†’

This article is auto-generated from nttcslab/msm-mae via the GitHub API.Last fetched: 6/19/2026