GitPedia

MindPipe

A powerful model compression framework for LLMs and LVLMs, adapted for NVIDIA GPUs and Huawei Ascend NPUs.

From MAC-AutoMLΒ·Updated June 16, 2026Β·View on GitHubΒ·

**A Unified Compression & Evaluation Framework for LLMs and VLMs** The project is written primarily in Python, first published in 2026. It has gained significant community traction with 1,010 stars and 24 forks on GitHub. Key topics include: automatic-compression, compression, deployment, evaluation, huawei-ascend-npus.

<div align="center">

🧠 MindPipe

A Unified Compression & Evaluation Framework for LLMs and VLMs

Python 3.10+
PyTorch
Hugging Face
License
NPU Ready

English | δΈ­ζ–‡

<p align="center"> <em>One CLI. 11 quantization methods. 7 pruning methods. GPU & NPU. Text & Vision.</em> </p>

Quantize Β· Prune Β· Evaluate Β· Reproduce

</div>

✨ Why MindPipe?

Most compression tools only handle one technique on one type of model.
MindPipe unifies them all under a single, reproducible pipeline.

<table> <tr> <td width="50%">

🎯 One Entrypoint, All Methods

A single main.py drives quantization, pruning, combined workflows, and evaluation β€” no juggling scripts.

πŸ”€ GPU + NPU

First-class support for both CUDA GPUs and Ascend NPUs with shared device abstraction.

πŸ“Š Integrated Evaluation

PPL, lm-eval-harness zero-shot, and VLMEvalKit multimodal benchmarks β€” all built in.

</td> <td width="50%">

🧩 Modular & Extensible

Clean registry-based architecture makes adding new algorithms straightforward.

πŸ”¬ Reproducibility First

JSON artifacts, batch scripts, and per-run metrics ensure every result is traceable.

πŸ‘οΈ Vision-Language Native

Not an afterthought β€” VLMs are first-class citizens with dedicated multimodal eval.

</td> </tr> </table>

πŸš€ Quick Start

bash
# 1. Setup conda activate mindpipe git submodule update --init --recursive pip install -r requirements.txt # 2. Quantize a model (AWQ W4A16) CUDA_VISIBLE_DEVICES=0 python main.py \ --quantization awq \ --model_path /path/to/model \ --device_map auto \ --dtype float16 \ --calibration_dataset pileval \ --calibration_samples 128 \ --sequence_length 2048 \ --weight_bits 4 \ --group_size 128 \ --eval_ppl true \ --output_dir ./results/awq # 3. Prune a model (Wanda 50% sparsity) CUDA_VISIBLE_DEVICES=0 python main.py \ --pruning wanda \ --model_path /path/to/model \ --device_map auto \ --dtype float16 \ --calibration_dataset c4 \ --calibration_samples 128 \ --sparsity_ratio 0.5 \ --eval_ppl true \ --output_dir ./results/wanda # 4. Recover a FlatQuant + pruning workflow with Compression LoRA MODEL_PATH=/path/to/model \ GPU_ID=0,1 \ bash scripts/finetuning/flatquant_lora_auto_gpu.sh
<details> <summary><b>πŸ“‹ More Examples (Click to Expand)</b></summary>

Full-Precision Evaluation

bash
CUDA_VISIBLE_DEVICES=0 python main.py \ --model_path /path/to/model \ --device_map auto \ --dtype float16 \ --attn_implementation sdpa \ --evaluation_dataset wikitext2 \ --sequence_length 2048 \ --batch_size 1 \ --max_eval_chunks 64 \ --eval_ppl true \ --eval_zero_shot true \ --zero_shot_tasks boolq piqa rte winogrande arc_easy arc_challenge openbookqa \ --zero_shot_num_fewshot 0 \ --zero_shot_batch_size 1 \ --output_dir ./results/fp_eval

GPTQ Quantization

bash
CUDA_VISIBLE_DEVICES=0 python main.py \ --quantization gptq \ --model_path /path/to/model \ --device_map auto \ --dtype float16 \ --attn_implementation sdpa \ --calibration_dataset pileval \ --evaluation_dataset wikitext2 \ --calibration_samples 128 \ --sequence_length 2048 \ --weight_bits 4 \ --activation_bits 16 \ --group_size 128 \ --weight_group_size 128 \ --eval_ppl true \ --output_dir ./results/gptq

Pruning + Quantization Pipeline

bash
CUDA_VISIBLE_DEVICES=0,1 python main.py \ --pruning wanda_sp \ --quantization gptq \ --execution_order pruning_then_quantization \ --model_path /path/to/model \ --device_map auto \ --dtype float16 \ --attn_implementation sdpa \ --calibration_dataset c4 \ --calibration_samples 128 \ --sequence_length 2048 \ --sparsity_ratio 0.2 \ --weight_bits 4 \ --group_size 128 \ --eval_ppl true \ --output_dir ./results/workflow

VLM Multimodal Evaluation

bash
CUDA_VISIBLE_DEVICES=0 python main.py \ --model_path /path/to/vlm \ --device_map auto \ --dtype float16 \ --attn_implementation sdpa \ --eval_ppl false \ --eval_zero_shot false \ --eval_vlm true \ --vlm_datasets OCRBench TextVQA_VAL ChartQA_TEST InfoVQA_VAL \ --vlm_mode all \ --vlm_api_nproc 1 \ --vlm_eval_kit_root /path/to/VLMEvalKit \ --output_dir ./results/vlm_eval
</details>

πŸ“¦ Supported Algorithms

Use the method identifiers in the CLI column as command-line values. Display
names such as QA-LoRA, LLM-Pruner, and Wanda-SP are descriptive; the actual CLI
values are qalora, llm_pruner, and wanda_sp.

Quantization (11 Methods)

MethodCLIFamilyTechniqueNPU
AWQawqPTQWeight-only with activation-aware scalingβœ…
GPTQgptqPTQWeight-only GPTQ quantizationβœ…
MQuantmquantPTQMultimodal GPTQ/AWQ for language & visual branches⏳
OmniQuantomniquantPTQLearnable weight & activation transformationβœ…
QuaRotquarotPTQRotation-based W/A/KV quantization⏳
SmoothQuantsmoothquantPTQActivation smoothing for W/A quantizationβœ…
SpinQuantspinquantPTQRotation-based W/A/KV with SpinQuant hooks⏳
FlatQuantflatquantQATTrainable transformationsβœ…
QLoRAqloraQATLow-bit fake-quant adapter trainingβœ…
QA-LoRAqaloraQATGroup-pooled adapter trainingπŸ”Ά
SplitQuantsplitquantQATSplitQuant-style trainable transformationsβœ…

Pruning (7 Methods)

MethodCLITypeCalibrationNPU
ALPSalpsUnstructured / n:mc4βœ…
FLAPflapStructuredwikitext2βœ…
LLM-Prunerllm_prunerStructuredc4βœ…
ShortGPTshortgptLayer pruningpg19βœ…
SparseGPTsparsegptUnstructured / n:mc4βœ…
WandawandaUnstructured / n:mc4βœ…
Wanda-SPwanda_spStructuredc4βœ…

Finetuning

MethodCLIScopeNPU
Compression LoRAcompression_loraFlatQuant + fixed-mask pruning recovery⏳

Compression LoRA currently supports FlatQuant combined with fixed-shape pruning
masks, such as Wanda, SparseGPT, and ALPS. Structured pruning requires
pseudo-pruning mode.

βœ… Ready Β |Β  ⏳ In Progress Β |Β  πŸ”Ά CUDA Only


πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         main.py (CLI)                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                    workflow/ (Config + Executor)                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Quantization β”‚   Pruning    β”‚  Finetuning  β”‚     Evaluation     β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ PTQ (7)  β”‚ β”‚ β”‚Structuredβ”‚ β”‚ β”‚Compress. β”‚ β”‚ β”‚ PPL            β”‚ β”‚
β”‚ β”‚ QAT (4)  β”‚ β”‚ β”‚Unstruct. β”‚ β”‚ β”‚LoRA      β”‚ β”‚ β”‚ Zero-shot      β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚LayerPruneβ”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ VLM Eval       β”‚ β”‚
β”‚              β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚              β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚              algorithm/common/ (Shared Infrastructure)          β”‚
β”‚     Model Loading Β· Data Β· Device (GPU/NPU) Β· IO Β· Metrics      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Repository Layout

MindPipe/
β”œβ”€β”€ main.py                         # Unified CLI entrypoint
β”œβ”€β”€ algorithm/
β”‚   β”œβ”€β”€ common/                     # Shared model, data, device, IO utilities
β”‚   β”œβ”€β”€ quantization/
β”‚   β”‚   β”œβ”€β”€ ptq/                    # AWQ, GPTQ, MQuant, OmniQuant, QuaRot, SmoothQuant, SpinQuant
β”‚   β”‚   └── qat/                    # FlatQuant, QLoRA, QA-LoRA, SplitQuant
β”‚   β”œβ”€β”€ finetuning/                 # Compression LoRA recovery finetuning
β”‚   └── pruning/
β”‚       β”œβ”€β”€ structured/             # FLAP, LLM-Pruner, ShortGPT, Wanda-SP
β”‚       └── unstructured/           # ALPS, SparseGPT, Wanda
β”œβ”€β”€ workflow/                       # CLI config builder and stage executor
β”œβ”€β”€ evaluation/                     # PPL, lm-eval-harness, and VLMEvalKit runners
β”œβ”€β”€ configs/                        # Shared and algorithm-specific configs
β”œβ”€β”€ scripts/                        # Batch and reproducibility scripts
└── third_party/                    # Optional external evaluation tools

πŸ€– Model Coverage

<table> <tr><td>
Model FamilyTextVision
LLaMA-2 / LLaMA-3βœ…β€”
Qwen2.5βœ…β€”
Qwen3βœ…β€”
Qwen3.5 / Qwen3.6 denseβœ…βœ…
</td><td>
Model FamilyTextVision
Qwen2-VLβœ…βœ…
Qwen2.5-VLβœ…βœ…
Qwen3-VLβœ…βœ…
MiniCPM-Vβœ…βœ…
LLaVA / InternVLβœ…πŸ”Ά
</td></tr> </table>

Note: Model support is algorithm-dependent. Check algorithm/quantization/*/*/method.py or algorithm/pruning/*/*/method.py for exact coverage.
MoE variants are method-dependent and are not covered by Compression LoRA yet.


πŸ“– Configuration Reference

<details> <summary><b>βš™οΈ Common Arguments</b></summary>
ArgumentDefaultDescription
--model_pathRequiredLocal or Hugging Face model path
--deviceautoLogical device used by runtime helpers
--device_mapNoneRequired for pruning/quantization (auto recommended)
--dtypebfloat16auto, float16, or bfloat16
--attn_implementationflash_attention_2flash_attention_2, sdpa, or eager
--calibration_datasetMethod defaultwikitext2, c4, pileval, pg19, or bookcorpus
--evaluation_datasetwikitext2Dataset used for PPL evaluation
--calibration_samples128Number of calibration samples
--sequence_length2048Sequence length for calibration and evaluation
--batch_size1PPL batch size
--max_eval_chunks64Optional cap for PPL chunks
--eval_pplfalseEnable perplexity evaluation
--eval_zero_shotfalseEnable lm-eval-harness tasks
--eval_vlmfalseEnable VLMEvalKit evaluation
</details> <details> <summary><b>πŸ”’ Quantization Arguments</b></summary>
ArgumentDefaultDescription
--quantizationNoneOne of the registered quantization methods
--weight_bits4Weight quantization bit width
--activation_bits16Activation quantization bit width
--query_bits16Query activation bit width
--key_bits16Key cache bit width
--value_bits16Value cache bit width
--group_size128Default group size
--weight_group_sizeNoneOverrides weight group size
--activation_group_sizeNoneOverrides activation group size
--kv_group_sizeNoneOverrides KV group size
--weight_methodgptqWeight method for methods supporting GPTQ/RTN
</details> <details> <summary><b>βœ‚οΈ Pruning Arguments</b></summary>
ArgumentDefaultDescription
--pruningNoneOne of the registered pruning methods
--sparsity_ratio0.5Target sparsity ratio
--structure_patternunstructuredunstructured, 2:4, or 4:8
--block_size128Block size for supported pruning methods
--damp_percent0.01Hessian damping ratio for second-order methods
</details>

πŸ”§ Installation

Prerequisites

  • Python 3.10+
  • PyTorch 2.0+
  • CUDA 11.8+ (for GPU) or Ascend CANN (for NPU)

Setup

bash
# Clone the repository git clone https://github.com/your-org/MindPipe.git cd MindPipe # Create environment conda create -n mindpipe python=3.10 -y conda activate mindpipe # Install dependencies git submodule update --init --recursive pip install -r requirements.txt

Optional: VLMEvalKit

For multimodal evaluation, initialize the VLMEvalKit submodule or set VLMEVALKIT_ROOT:

bash
git submodule update --init third_party/VLMEvalKit # or export VLMEVALKIT_ROOT=/path/to/existing/VLMEvalKit

πŸ“ˆ Reproducibility

The scripts/repro/ directory contains ready-to-use benchmark launchers:

bash
# Dry run (print commands without executing) DRY_RUN=true bash scripts/repro/run_qlora_adapted_models_text_suite.sh # Filter specific models MODEL_FILTER=qwen3 bash scripts/repro/run_mquantpp_awq_vlm_serial_suite.sh

Available scripts include:

  • run_qlora_adapted_models_text_suite.sh
  • run_qalora_adapted_models_text_suite.sh
  • run_mquantpp_awq_vlm_serial_suite.sh
  • run_qwen2_5_vl_gptq_vlm_suite.sh
  • run_qwen3_vl_2b_gptq_suite.sh

Compression LoRA finetuning launchers are grouped under scripts/finetuning/:

  • flatquant_lora_auto_gpu.sh automatically dispatches by model config.
  • llm/flatquant_lora_llm_gpu.sh is for text-only LLMs.
  • vlm/flatquant_lora_vlm_gpu.sh is for MiniCPM-V, Qwen2.5-VL, and Qwen3-VL.
  • qwen3_5/flatquant_lora_qwen3_5_gpu.sh is for dense Qwen3.5/Qwen3.6 VLMs.

πŸ“‚ Output Structure

results/
β”œβ”€β”€ <model>/<algorithm>/<run_spec>/
β”‚   β”œβ”€β”€ metrics.json        # Evaluation results & run metadata
β”‚   └── artifacts.json      # Algorithm details, calibration settings, checkpoints
└── <model>/<execution_order>/<algorithm1>__<algorithm2>/<run_spec>/
    └── metrics.json

⚠️ Known Limitations

LimitationStatus
QuaRot / SpinQuant not NPU-ready⏳
MQuant GPU-only⏳
QA-LoRA CUDA-only, no AutoGPTQ exportπŸ”Ά
QLoRA W2/W3 use fake-quant fallback on NPUℹ️
Compression LoRA currently requires FlatQuant + fixed-mask pruningℹ️
Qwen3.5-MoE / Qwen3.6-35B-A3B Compression LoRA is not supported yet⏳
Custom runtime wrapper reload is method-dependentℹ️

πŸ“œ Citation & Acknowledgements

MindPipe builds upon the following outstanding research. Please cite the original papers when using their methods:

<details> <summary><b>Click to see referenced works</b></summary>
  • AWQ β€” Activation-aware Weight Quantization
  • GPTQ β€” Accurate Post-Training Quantization for Generative Pre-trained Transformers
  • QuaRot β€” Outlier-Free Quantization via Rotations
  • SpinQuant β€” Rotation-Based Quantization
  • FlatQuant β€” Flatness-Aware Quantization
  • SmoothQuant β€” Accurate and Efficient Post-Training Quantization
  • OmniQuant β€” Omnidirectionally Calibrated Quantization
  • SplitQuant β€” Split Quantization
  • QLoRA β€” Efficient Finetuning of Quantized LLMs
  • QA-LoRA β€” Quantization-Aware Low-Rank Adaptation
  • Wanda β€” Pruning by Weights and Activations
  • SparseGPT β€” Massive Language Models Can Be Accurately Pruned in One-Shot
  • FLAP β€” Fluctuation-based Adaptive Structured Pruning
  • ShortGPT β€” Layers in LLMs are More Redundant Than You Expect
  • LLM-Pruner β€” On the Structural Pruning of Large Language Models
  • ALPS β€” Adaptive Layer-wise Pruning and Sparsification
</details>
<div align="center">

⭐ If you find MindPipe useful, please consider giving it a star!

Built with ❀️ for the model compression community

</div>

Contributors

Showing top 6 contributors by commit count.

View all contributors on GitHub β†’

This article is auto-generated from MAC-AutoML/MindPipe via the GitHub API.Last fetched: 6/18/2026