Westlake-AI/MogaNet — Gitpedia

<div align="center">  <h2><a href="https://arxiv.org/abs/2211.03295">MogaNet: Multi-order Gated Aggregation Network (ICLR 2024)</a></h2>

Siyuan Li*,1,2, Zedong Wang*,1, Zicheng Liu1,2, Chen Tan1,2, Haitao Lin1,2, Di Wu1,2, Zhiyuan Chen1, Jiangbin Zheng1,2, Stan Z. Li†,1

1Westlake University, 2Zhejiang University

</div> <a href="https://arxiv.org/abs/2211.03295" alt="arXiv"> <img src="https://img.shields.io/badge/arXiv-2211.03295-b31b1b.svg?style=flat" /></a> <a href="https://github.com/Westlake-AI/MogaNet/blob/main/LICENSE" alt="license"> <img src="https://img.shields.io/badge/license-Apache--2.0-%23B7A800" /></a> <a href="https://colab.research.google.com/github/Westlake-AI/MogaNet/blob/main/demo.ipynb" alt="Colab"> <img src="https://colab.research.google.com/assets/colab-badge.svg" /></a> <a href="https://huggingface.co/MogaNet" alt="Huggingface"> <img src="https://img.shields.io/badge/huggingface-MogaNet-blueviolet" /></a> <img src="https://user-images.githubusercontent.com/44519745/202308950-00708e25-9ac7-48f0-af12-224d927ac1ae.jpg" width=100% height=100% class="center">

We propose MogaNet, a new family of efficient ConvNets designed through the lens of multi-order game-theoretic interaction, to pursue informative context mining with preferable complexity-performance trade-offs. It shows excellent scalability and attains competitive results among state-of-the-art models with more efficient use of model parameters on ImageNet and multifarious typical vision benchmarks, including COCO object detection, ADE20K semantic segmentation, 2D&3D human pose estimation, and video prediction.

This repository contains PyTorch implementation for MogaNet (ICLR 2024).

<details> <summary>Table of Contents</summary> <ol> <li><a href="#catalog">Catalog</a></li> <li><a href="#image-classification">Image Classification</a></li> <li><a href="#license">License</a></li> <li><a href="#acknowledgement">Acknowledgement</a></li> <li><a href="#citation">Citation</a></li> </ol> </details>

Catalog

We plan to release implementations of MogaNet in a few months. Please watch us for the latest release. Currently, this repo is reimplemented according to our official implementations in OpenMixup, and we are working on cleaning up experimental results and code implementations. Models are released in GitHub / Baidu Cloud / Hugging Face.

ImageNet-1K Training and Validation Code with timm [code] [models] [Hugging Face 🤗]
ImageNet-1K Training and Validation Code in OpenMixup / MMPretrain (TODO)
Downstream Transfer to Object Detection and Instance Segmentation on COCO [code] [models] [demo]
Downstream Transfer to Semantic Segmentation on ADE20K [code] [models] [demo]
Downstream Transfer to 2D Human Pose Estimation on COCO [code] (baselines supported) [models] [demo]
Downstream Transfer to 3D Human Pose Estimation (baselines supported) [code] [models]
Downstream Transfer to Video Prediction on MMNIST Variants [code] (baselines supported)
Image Classification on Google Colab and Notebook Demo [demo]

Image Classification

1. Installation

Please check INSTALL.md for installation instructions.

2. Training and Validation

See TRAINING.md for ImageNet-1K training and validation instructions, or refer to our OpenMixup implementations. We released pre-trained models on OpenMixup in moganet-in1k-weights. We have also reproduced ImageNet results with this repo and released args.yaml / summary.csv / model.pth.tar in moganet-in1k-weights. The parameters in the trained model can be extracted by code.

Here is a notebook demo of MogaNet which run the steps to perform inference with MogaNet for image classification.

3. ImageNet-1K Trained Models

Model	Resolution	Params (M)	Flops (G)	Top-1 / top-5 (%)	Script	Download
MogaNet-XT	224x224	2.97	0.80	76.5 \| 93.4	args \| script	model \| log
MogaNet-XT	256x256	2.97	1.04	77.2 \| 93.8	args \| script	model \| log
MogaNet-T	224x224	5.20	1.10	79.0 \| 94.6	args \| script	model \| log
MogaNet-T	256x256	5.20	1.44	79.6 \| 94.9	args \| script	model \| log
MogaNet-T*	256x256	5.20	1.44	80.0 \| 95.0	config \| script	model \| log
MogaNet-S	224x224	25.3	4.97	83.4 \| 96.9	args \| script	model \| log
MogaNet-B	224x224	43.9	9.93	84.3 \| 97.0	args \| script	model \| log
MogaNet-L	224x224	82.5	15.9	84.7 \| 97.1	args \| script	model \| log
MogaNet-XL	224x224	180.8	34.5	85.1 \| 97.4	args \| script	model \| log

4. Analysis Tools

(1) The code to count MACs of MogaNet variants.

python get_flops.py --model moganet_tiny

(2) The code to visualize Grad-CAM activation maps (or variants of Grad-CAM) of MogaNet and other popular architectures.

python cam_image.py --use_cuda --image_path /path/to/image.JPEG --model moganet_tiny --method gradcam

(<a href="#top">back to top</a>)

5. Downstream Tasks

<details> <summary>Object Detection and Instance Segmentation on COCO</summary> <li><a href="https://github.com/Westlake-AI/MogaNet/tree/main/detection">MogaNet + Mask R-CNN</a></li>

Method	Backbone	Pretrain	Params	FLOPs	Lr schd	box mAP	mask mAP	Config	Download
Mask R-CNN	MogaNet-XT	ImageNet-1K	22.8M	185.4G	1x	40.7	37.6	config	log / model
Mask R-CNN	MogaNet-T	ImageNet-1K	25.0M	191.7G	1x	42.6	39.1	config	log / model
Mask R-CNN	MogaNet-S	ImageNet-1K	45.0M	271.6G	1x	46.6	42.2	config	log / model
Mask R-CNN	MogaNet-B	ImageNet-1K	63.4M	373.1G	1x	49.0	43.8	config	log / model
Mask R-CNN	MogaNet-L	ImageNet-1K	102.1M	495.3G	1x	49.4	44.2	config	log / model
Mask R-CNN	MogaNet-T	ImageNet-1K	25.0M	191.7G	MS 3x	45.3	40.7	config	log / model
Mask R-CNN	MogaNet-S	ImageNet-1K	45.0M	271.6G	MS 3x	48.5	43.1	config	log / model
Mask R-CNN	MogaNet-B	ImageNet-1K	63.4M	373.1G	MS 3x	50.3	44.4	config	log / model
Mask R-CNN	MogaNet-L	ImageNet-1K	63.4M	373.1G	MS 3x	50.6	44.6	config	log / model

<li><a href="https://github.com/Westlake-AI/MogaNet/tree/main/detection">MogaNet + RetinaNet</a></li>

Method	Backbone	Pretrain	Params	FLOPs	Lr schd	box mAP	Config	Download
RetinaNet	MogaNet-XT	ImageNet-1K	12.1M	167.2G	1x	39.7	config	log / model
RetinaNet	MogaNet-T	ImageNet-1K	14.4M	173.4G	1x	41.4	config	log / model
RetinaNet	MogaNet-S	ImageNet-1K	35.1M	253.0G	1x	45.8	config	log / model
RetinaNet	MogaNet-B	ImageNet-1K	53.5M	354.5G	1x	47.7	config	log / model
RetinaNet	MogaNet-L	ImageNet-1K	92.4M	476.8G	1x	48.7	config	log / model

<li><a href="https://github.com/Westlake-AI/MogaNet/tree/main/detection">MogaNet + Cascade Mask R-CNN</a></li>

Method	Backbone	Pretrain	Params	FLOPs	Lr schd	box mAP	mask mAP	Config	Download
Cascade Mask R-CNN	MogaNet-S	ImageNet-1K	77.9M	405.4G	MS 3x	51.4	44.9	config	log / model
Cascade Mask R-CNN	MogaNet-S	ImageNet-1K	82.8M	750.2G	GIOU+MS 3x	51.7	45.1	config	log / model
Cascade Mask R-CNN	MogaNet-B	ImageNet-1K	101.2M	851.6G	GIOU+MS 3x	52.6	46.0	config	log / model
Cascade Mask R-CNN	MogaNet-L	ImageNet-1K	139.9M	973.8G	GIOU+MS 3x	53.3	46.1	config	-

</details> <details> <summary>Semantic Segmentation on ADE20K</summary> <li><a href="https://github.com/Westlake-AI/MogaNet/tree/main/segmentation">MogaNet + Semantic FPN</a></li>

Method	Backbone	Pretrain	Params	FLOPs	Iters	mIoU	mAcc	Config	Download
Semantic FPN	MogaNet-XT	ImageNet-1K	6.9M	101.4G	80K	40.3	52.4	config	log / model
Semantic FPN	MogaNet-T	ImageNet-1K	9.1M	107.8G	80K	43.1	55.4	config	log / model
Semantic FPN	MogaNet-S	ImageNet-1K	29.1M	189.7G	80K	47.7	59.8	config	log / model
Semantic FPN	MogaNet-B	ImageNet-1K	47.5M	293.6G	80K	49.3	61.6	config	log / model
Semantic FPN	MogaNet-L	ImageNet-1K	86.2M	418.7G	80K	50.2	63.0	config	log / model

<li><a href="https://github.com/Westlake-AI/MogaNet/tree/main/segmentation">MogaNet + UperNet</a></li>

Method	Backbone	Pretrain	Params	FLOPs	Iters	mIoU	mAcc	Config	Download
UperNet	MogaNet-XT	ImageNet-1K	30.4M	855.7G	160K	42.2	55.1	config	log / model
UperNet	MogaNet-T	ImageNet-1K	33.1M	862.4G	160K	43.7	57.1	config	log / model
UperNet	MogaNet-S	ImageNet-1K	55.3M	946.4G	160K	49.2	61.6	config	log / model
UperNet	MogaNet-B	ImageNet-1K	73.7M	1050.4G	160K	50.1	63.4	config	log / model
UperNet	MogaNet-L	ImageNet-1K	113.2M	1176.1G	160K	50.9	63.5	config	log / model

</details> <details> <summary>2D Human Pose Estimation on COCO</summary> <li><a href="https://github.com/Westlake-AI/MogaNet/tree/main/pose_estimation">MogaNet + Top-Down</a></li>

Backbone	Input Size	Params	FLOPs	AP	AP<sup>50</sup>	AP<sup>75</sup>	AR	AR<sup>M</sup>	AR<sup>L</sup>	Config	Download
MogaNet-XT	256x192	5.6M	1.8G	72.1	89.7	80.1	77.7	73.6	83.6	config	log \| model
MogaNet-XT	384x288	5.6M	4.2G	74.7	90.1	81.3	79.9	75.9	85.9	config	log \| model
MogaNet-T	256x192	8.1M	2.2G	73.2	90.1	81.0	78.8	74.9	84.4	config	log \| model
MogaNet-T	384x288	8.1M	4.9G	75.7	90.6	82.6	80.9	76.8	86.7	config	log \| model
MogaNet-S	256x192	29.0M	6.0G	74.9	90.7	82.8	80.1	75.7	86.3	config	log \| model
MogaNet-S	384x288	29.0M	13.5G	76.4	91.0	83.3	81.4	77.1	87.7	config	log \| model
MogaNet-B	256x192	47.4M	10.9G	75.3	90.9	83.3	80.7	76.4	87.1	config	log \| model
MogaNet-B	384x288	47.4M	24.4G	77.3	91.4	84.0	82.2	77.9	88.5	config	log \| model

</details> <details> <summary>Video Prediction on Moving MNIST</summary>

Architecture	Setting	Params	FLOPs	FPS	MSE	MAE	SSIM	PSNR	Download
IncepU (SimVPv1)	200 epoch	58.0M	19.4G	209	32.15	89.05	0.9268	21.84	model \| log
gSTA (SimVPv2)	200 epoch	46.8M	16.5G	282	26.69	77.19	0.9402	22.78	model \| log
ViT	200 epoch	46.1M	16.9G	290	35.15	95.87	0.9139	21.67	model \| log
Swin Transformer	200 epoch	46.1M	16.4G	294	29.70	84.05	0.9331	22.22	model \| log
Uniformer	200 epoch	44.8M	16.5G	296	30.38	85.87	0.9308	22.13	model \| log
MLP-Mixer	200 epoch	38.2M	14.7G	334	29.52	83.36	0.9338	22.22	model \| log
ConvMixer	200 epoch	3.9M	5.5G	658	32.09	88.93	0.9259	21.93	model \| log
Poolformer	200 epoch	37.1M	14.1G	341	31.79	88.48	0.9271	22.03	model \| log
ConvNeXt	200 epoch	37.3M	14.1G	344	26.94	77.23	0.9397	22.74	model \| log
VAN	200 epoch	44.5M	16.0G	288	26.10	76.11	0.9417	22.89	model \| log
HorNet	200 epoch	45.7M	16.3G	287	29.64	83.26	0.9331	22.26	model \| log
MogaNet	200 epoch	46.8M	16.5G	255	25.57	75.19	0.9429	22.99	model \| log
IncepU (SimVPv1)	2000 epoch	58.0M	19.4G	209	21.15	64.15	0.9536	23.99	model \| log
gSTA (SimVPv2)	2000 epoch	46.8M	16.5G	282	15.05	49.80	0.9675	25.97	model \| log
ViT	2000 epoch	46.1M	16.9.G	290	19.74	61.65	0.9539	24.59	model \| log
Swin Transformer	2000 epoch	46.1M	16.4G	294	19.11	59.84	0.9584	24.53	model \| log
Uniformer	2000 epoch	44.8M	16.5G	296	18.01	57.52	0.9609	24.92	model \| log
MLP-Mixer	2000 epoch	38.2M	14.7G	334	18.85	59.86	0.9589	24.58	model \| log
ConvMixer	2000 epoch	3.9M	5.5G	658	22.30	67.37	0.9507	23.73	model \| log
Poolformer	2000 epoch	37.1M	14.1G	341	20.96	64.31	0.9539	24.15	model \| log
ConvNeXt	2000 epoch	37.3M	14.1G	344	17.58	55.76	0.9617	25.06	model \| log
VAN	2000 epoch	44.5M	16.0G	288	16.21	53.57	0.9646	25.49	model \| log
HorNet	2000 epoch	45.7M	16.3G	287	17.40	55.70	0.9624	25.14	model \| log
MogaNet	2000 epoch	46.8M	16.5G	255	15.67	51.84	0.9661	25.70	model \| log

<summary>Video Prediction on Moving FMNIST</summary>

Architecture	Setting	Params	FLOPs	FPS	MSE	MAE	SSIM	PSNR	Download
IncepU (SimVPv1)	200 epoch	58.0M	19.4G	209	30.77	113.94	0.8740	21.81	model \| log
gSTA (SimVPv2)	200 epoch	46.8M	16.5G	282	25.86	101.22	0.8933	22.61	model \| log
ViT	200 epoch	46.1M	16.9.G	290	31.05	115.59	0.8712	21.83	model \| log
Swin Transformer	200 epoch	46.1M	16.4G	294	28.66	108.93	0.8815	22.08	model \| log
Uniformer	200 epoch	44.8M	16.5G	296	29.56	111.72	0.8779	21.97	model \| log
MLP-Mixer	200 epoch	38.2M	14.7G	334	28.83	109.51	0.8803	22.01	model \| log
ConvMixer	200 epoch	3.9M	5.5G	658	31.21	115.74	0.8709	21.71	model \| log
Poolformer	200 epoch	37.1M	14.1G	341	30.02	113.07	0.8750	21.95	model \| log
ConvNeXt	200 epoch	37.3M	14.1G	344	26.41	102.56	0.8908	22.49	model \| log
VAN	200 epoch	44.5M	16.0G	288	31.39	116.28	0.8703	22.82	model \| log
HorNet	200 epoch	45.7M	16.3G	287	29.19	110.17	0.8796	22.03	model \| log
MogaNet	200 epoch	46.8M	16.5G	255	25.14	99.69	0.8960	22.73	model \| log

</details>

License

This project is released under the Apache 2.0 license.

Acknowledgement

Our implementation is mainly based on the following codebases. We gratefully thank the authors for their wonderful works.

pytorch-image-models (timm): PyTorch image models, scripts, pretrained weights.
PoolFormer: Official PyTorch implementation of MetaFormer.
ConvNeXt: Official PyTorch implementation of ConvNeXt.
OpenMixup: Open-source toolbox for visual representation learning.
MMDetection: OpenMMLab Detection Toolbox and Benchmark.
MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark.
MMPose: OpenMMLab Pose Estimation Toolbox and Benchmark.
MMHuman3D: OpenMMLab 3D Human Parametric Model Toolbox and Benchmark.
OpenSTL: A Comprehensive Benchmark of Spatio-Temporal Predictive Learning.

Citation

If you find this repository helpful, please consider citing:

@inproceedings{iclr2024MogaNet,
  title={MogaNet: Multi-order Gated Aggregation Network},
  author={Siyuan Li and Zedong Wang and Zicheng Liu and Cheng Tan and Haitao Lin and Di Wu and Zhiyuan Chen and Jiangbin Zheng and Stan Z. Li},
  booktitle={International Conference on Learning Representations},
  year={2024}
}

(<a href="#top">back to top</a>)