<div align="center"> <h1>More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models</h1>

<a href="https://https://github.com/HongkLin/" target="_blank" rel="noopener noreferrer">Hongkai Lin</a>,
<a href="https://dk-liang.github.io/" target="_blank" rel="noopener noreferrer">Dingkang Liang</a>,
Mingyang Du,
<a href="https://lmd0311.github.io/" target="_blank" rel="noopener noreferrer">Xin Zhou</a>,
<a href="https://scholar.google.com/citations?user=UeltiQ4AAAAJ&hl=en" target="_blank" rel="noopener noreferrer">Xiang Bai</a><sup>†</sup>

Huazhong University of Science & Technology

($\dagger$) Corresponding author.

</div>

MERGE_teasor.
We present MERGE, a simple unified diffusion model for image generation and depth estimation. Its core lies in leveraging streamlined converters and rich visual prior stored in generative image models. Our model, derived from fixed generative image models and fine-tuned pluggable converters with synthetic data, expands powerful zero-shot depth estimation capability.

📢 News

[21/Oct/2025] The training and inference code is now available!
[18/Sep/2025] MERGE is accepted to NeurIPS 2025! 🥳🥳🥳

🛠️ Setup

This installation was tested on: Ubuntu 20.04 LTS, Python 3.9.21, CUDA 11.8, NVIDIA H20-80GB.

Clone the repository (requires git):

git clone https://github.com/HongkLin/MERGE
cd MERGE

Install dependencies (requires conda):

conda create -n merge python=3.9.21 -y
conda activate merge
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

🔥 Training

Follow Marigold to prepare depth training data (Hypersim and Virtual KITTI 2), the default dataset structure is as follows:

datasets/
    hypersim/
        test/
        train/
            ai_001_001/
            ...
            ai_055_010/
        val/
    vkitti/
        depth/
            Scene01/
            ...
            Scene20/
        rgb/

Download the pre-trained PixArt-α and FLUX.1 [dev], then modify the pretrained_model_name_or_path.
Run the training command! 🚀

conda activate merge

# Training MERGE-B model
bash train_scripts/train_merge_b_depth.sh

# Training MERGE-L model
bash train_scripts/train_merge_l_depth.sh

🕹️ Inference

Place your images in a directory, for example, under /data (where we have prepared several examples).
Run the inference command:

# for MERGE-B
python inference_merge_base_depth.py --pretrained_model_path PATH/PixArt-XL-2-512x512 --model_weights PATH/merge_base_depth --image_path ./data/demo_1.png

# for MERGE-L
python inference_merge_large_depth.py --pretrained_model_path PATH/FLUX.1-dev --model_weights PATH/merge_large_depth --image_path ./data/demo_1.png

Choose your model

Below are the released models and their corresponding configurations:

CHECKPOINT_DIR	PRETRAINED_MODEL	TASK_NAME
`merge-base-depth-v1`	PixArt-XL-2-512x512	depth
`merge-large-depth-v1`	FLUX.1-dev	depth

⚖️ Main Results

Zero-shot Depth Estimation Results

Zero-shot Normal Estimation Results

📖BibTeX

If you find this repository useful in your research, please consider giving a star ⭐ and a citation

@inproceedings{lin2025merge,
      title={More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models}, 
      author={Lin, Hongkai and Liang, Dingkang and Mingyang Du and Xin Zhou and Bai, Xiang},
      booktitle={Advances in Neural Information Processing Systems},
      year={2025},
}

🤗Acknowledgements

Thanks to Diffusers for their wonderful technical support and awesome collaboration!
Thanks to Hugging Face for sponsoring the nicely demo!
Thanks to DiT for their wonderful work and codebase!
Thanks to PixArt-α for their wonderful work and codebase!
Thanks to FLUX, Marigolod for their wonderful work!