MindPipe
A powerful model compression framework for LLMs and LVLMs, adapted for NVIDIA GPUs and Huawei Ascend NPUs.
**A Unified Compression & Evaluation Framework for LLMs and VLMs** The project is written primarily in Python, first published in 2026. It has gained significant community traction with 1,010 stars and 24 forks on GitHub. Key topics include: automatic-compression, compression, deployment, evaluation, huawei-ascend-npus.
π§ MindPipe
A Unified Compression & Evaluation Framework for LLMs and VLMs
<p align="center"> <em>One CLI. 11 quantization methods. 7 pruning methods. GPU & NPU. Text & Vision.</em> </p>Quantize Β· Prune Β· Evaluate Β· Reproduce
</div>β¨ Why MindPipe?
<table> <tr> <td width="50%">Most compression tools only handle one technique on one type of model.
MindPipe unifies them all under a single, reproducible pipeline.
π― One Entrypoint, All Methods
A single main.py drives quantization, pruning, combined workflows, and evaluation β no juggling scripts.
π GPU + NPU
First-class support for both CUDA GPUs and Ascend NPUs with shared device abstraction.
π Integrated Evaluation
PPL, lm-eval-harness zero-shot, and VLMEvalKit multimodal benchmarks β all built in.
</td> <td width="50%">π§© Modular & Extensible
Clean registry-based architecture makes adding new algorithms straightforward.
π¬ Reproducibility First
JSON artifacts, batch scripts, and per-run metrics ensure every result is traceable.
ποΈ Vision-Language Native
Not an afterthought β VLMs are first-class citizens with dedicated multimodal eval.
</td> </tr> </table>π Quick Start
<details> <summary><b>π More Examples (Click to Expand)</b></summary>bash# 1. Setup conda activate mindpipe git submodule update --init --recursive pip install -r requirements.txt # 2. Quantize a model (AWQ W4A16) CUDA_VISIBLE_DEVICES=0 python main.py \ --quantization awq \ --model_path /path/to/model \ --device_map auto \ --dtype float16 \ --calibration_dataset pileval \ --calibration_samples 128 \ --sequence_length 2048 \ --weight_bits 4 \ --group_size 128 \ --eval_ppl true \ --output_dir ./results/awq # 3. Prune a model (Wanda 50% sparsity) CUDA_VISIBLE_DEVICES=0 python main.py \ --pruning wanda \ --model_path /path/to/model \ --device_map auto \ --dtype float16 \ --calibration_dataset c4 \ --calibration_samples 128 \ --sparsity_ratio 0.5 \ --eval_ppl true \ --output_dir ./results/wanda # 4. Recover a FlatQuant + pruning workflow with Compression LoRA MODEL_PATH=/path/to/model \ GPU_ID=0,1 \ bash scripts/finetuning/flatquant_lora_auto_gpu.sh
Full-Precision Evaluation
bashCUDA_VISIBLE_DEVICES=0 python main.py \ --model_path /path/to/model \ --device_map auto \ --dtype float16 \ --attn_implementation sdpa \ --evaluation_dataset wikitext2 \ --sequence_length 2048 \ --batch_size 1 \ --max_eval_chunks 64 \ --eval_ppl true \ --eval_zero_shot true \ --zero_shot_tasks boolq piqa rte winogrande arc_easy arc_challenge openbookqa \ --zero_shot_num_fewshot 0 \ --zero_shot_batch_size 1 \ --output_dir ./results/fp_eval
GPTQ Quantization
bashCUDA_VISIBLE_DEVICES=0 python main.py \ --quantization gptq \ --model_path /path/to/model \ --device_map auto \ --dtype float16 \ --attn_implementation sdpa \ --calibration_dataset pileval \ --evaluation_dataset wikitext2 \ --calibration_samples 128 \ --sequence_length 2048 \ --weight_bits 4 \ --activation_bits 16 \ --group_size 128 \ --weight_group_size 128 \ --eval_ppl true \ --output_dir ./results/gptq
Pruning + Quantization Pipeline
bashCUDA_VISIBLE_DEVICES=0,1 python main.py \ --pruning wanda_sp \ --quantization gptq \ --execution_order pruning_then_quantization \ --model_path /path/to/model \ --device_map auto \ --dtype float16 \ --attn_implementation sdpa \ --calibration_dataset c4 \ --calibration_samples 128 \ --sequence_length 2048 \ --sparsity_ratio 0.2 \ --weight_bits 4 \ --group_size 128 \ --eval_ppl true \ --output_dir ./results/workflow
VLM Multimodal Evaluation
</details>bashCUDA_VISIBLE_DEVICES=0 python main.py \ --model_path /path/to/vlm \ --device_map auto \ --dtype float16 \ --attn_implementation sdpa \ --eval_ppl false \ --eval_zero_shot false \ --eval_vlm true \ --vlm_datasets OCRBench TextVQA_VAL ChartQA_TEST InfoVQA_VAL \ --vlm_mode all \ --vlm_api_nproc 1 \ --vlm_eval_kit_root /path/to/VLMEvalKit \ --output_dir ./results/vlm_eval
π¦ Supported Algorithms
Use the method identifiers in the CLI column as command-line values. Display
names such as QA-LoRA, LLM-Pruner, and Wanda-SP are descriptive; the actual CLI
values are qalora, llm_pruner, and wanda_sp.
Quantization (11 Methods)
| Method | CLI | Family | Technique | NPU |
|---|---|---|---|---|
| AWQ | awq | PTQ | Weight-only with activation-aware scaling | β |
| GPTQ | gptq | PTQ | Weight-only GPTQ quantization | β |
| MQuant | mquant | PTQ | Multimodal GPTQ/AWQ for language & visual branches | β³ |
| OmniQuant | omniquant | PTQ | Learnable weight & activation transformation | β |
| QuaRot | quarot | PTQ | Rotation-based W/A/KV quantization | β³ |
| SmoothQuant | smoothquant | PTQ | Activation smoothing for W/A quantization | β |
| SpinQuant | spinquant | PTQ | Rotation-based W/A/KV with SpinQuant hooks | β³ |
| FlatQuant | flatquant | QAT | Trainable transformations | β |
| QLoRA | qlora | QAT | Low-bit fake-quant adapter training | β |
| QA-LoRA | qalora | QAT | Group-pooled adapter training | πΆ |
| SplitQuant | splitquant | QAT | SplitQuant-style trainable transformations | β |
Pruning (7 Methods)
| Method | CLI | Type | Calibration | NPU |
|---|---|---|---|---|
| ALPS | alps | Unstructured / n:m | c4 | β |
| FLAP | flap | Structured | wikitext2 | β |
| LLM-Pruner | llm_pruner | Structured | c4 | β |
| ShortGPT | shortgpt | Layer pruning | pg19 | β |
| SparseGPT | sparsegpt | Unstructured / n:m | c4 | β |
| Wanda | wanda | Unstructured / n:m | c4 | β |
| Wanda-SP | wanda_sp | Structured | c4 | β |
Finetuning
| Method | CLI | Scope | NPU |
|---|---|---|---|
| Compression LoRA | compression_lora | FlatQuant + fixed-mask pruning recovery | β³ |
Compression LoRA currently supports FlatQuant combined with fixed-shape pruning
masks, such as Wanda, SparseGPT, and ALPS. Structured pruning requires
pseudo-pruning mode.
β Ready Β |Β β³ In Progress Β |Β πΆ CUDA Only
ποΈ Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β main.py (CLI) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β workflow/ (Config + Executor) β
ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββ€
β Quantization β Pruning β Finetuning β Evaluation β
β ββββββββββββ β ββββββββββββ β ββββββββββββ β ββββββββββββββββββ β
β β PTQ (7) β β βStructuredβ β βCompress. β β β PPL β β
β β QAT (4) β β βUnstruct. β β βLoRA β β β Zero-shot β β
β ββββββββββββ β βLayerPruneβ β ββββββββββββ β β VLM Eval β β
β β ββββββββββββ β β ββββββββββββββββββ β
ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββ€
β algorithm/common/ (Shared Infrastructure) β
β Model Loading Β· Data Β· Device (GPU/NPU) Β· IO Β· Metrics β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Repository Layout
MindPipe/
βββ main.py # Unified CLI entrypoint
βββ algorithm/
β βββ common/ # Shared model, data, device, IO utilities
β βββ quantization/
β β βββ ptq/ # AWQ, GPTQ, MQuant, OmniQuant, QuaRot, SmoothQuant, SpinQuant
β β βββ qat/ # FlatQuant, QLoRA, QA-LoRA, SplitQuant
β βββ finetuning/ # Compression LoRA recovery finetuning
β βββ pruning/
β βββ structured/ # FLAP, LLM-Pruner, ShortGPT, Wanda-SP
β βββ unstructured/ # ALPS, SparseGPT, Wanda
βββ workflow/ # CLI config builder and stage executor
βββ evaluation/ # PPL, lm-eval-harness, and VLMEvalKit runners
βββ configs/ # Shared and algorithm-specific configs
βββ scripts/ # Batch and reproducibility scripts
βββ third_party/ # Optional external evaluation tools
π€ Model Coverage
<table> <tr><td>| Model Family | Text | Vision |
|---|---|---|
| LLaMA-2 / LLaMA-3 | β | β |
| Qwen2.5 | β | β |
| Qwen3 | β | β |
| Qwen3.5 / Qwen3.6 dense | β | β |
| Model Family | Text | Vision |
|---|---|---|
| Qwen2-VL | β | β |
| Qwen2.5-VL | β | β |
| Qwen3-VL | β | β |
| MiniCPM-V | β | β |
| LLaVA / InternVL | β | πΆ |
Note: Model support is algorithm-dependent. Check
algorithm/quantization/*/*/method.pyoralgorithm/pruning/*/*/method.pyfor exact coverage.
MoE variants are method-dependent and are not covered by Compression LoRA yet.
π Configuration Reference
<details> <summary><b>βοΈ Common Arguments</b></summary>| Argument | Default | Description |
|---|---|---|
--model_path | Required | Local or Hugging Face model path |
--device | auto | Logical device used by runtime helpers |
--device_map | None | Required for pruning/quantization (auto recommended) |
--dtype | bfloat16 | auto, float16, or bfloat16 |
--attn_implementation | flash_attention_2 | flash_attention_2, sdpa, or eager |
--calibration_dataset | Method default | wikitext2, c4, pileval, pg19, or bookcorpus |
--evaluation_dataset | wikitext2 | Dataset used for PPL evaluation |
--calibration_samples | 128 | Number of calibration samples |
--sequence_length | 2048 | Sequence length for calibration and evaluation |
--batch_size | 1 | PPL batch size |
--max_eval_chunks | 64 | Optional cap for PPL chunks |
--eval_ppl | false | Enable perplexity evaluation |
--eval_zero_shot | false | Enable lm-eval-harness tasks |
--eval_vlm | false | Enable VLMEvalKit evaluation |
| Argument | Default | Description |
|---|---|---|
--quantization | None | One of the registered quantization methods |
--weight_bits | 4 | Weight quantization bit width |
--activation_bits | 16 | Activation quantization bit width |
--query_bits | 16 | Query activation bit width |
--key_bits | 16 | Key cache bit width |
--value_bits | 16 | Value cache bit width |
--group_size | 128 | Default group size |
--weight_group_size | None | Overrides weight group size |
--activation_group_size | None | Overrides activation group size |
--kv_group_size | None | Overrides KV group size |
--weight_method | gptq | Weight method for methods supporting GPTQ/RTN |
| Argument | Default | Description |
|---|---|---|
--pruning | None | One of the registered pruning methods |
--sparsity_ratio | 0.5 | Target sparsity ratio |
--structure_pattern | unstructured | unstructured, 2:4, or 4:8 |
--block_size | 128 | Block size for supported pruning methods |
--damp_percent | 0.01 | Hessian damping ratio for second-order methods |
π§ Installation
Prerequisites
- Python 3.10+
- PyTorch 2.0+
- CUDA 11.8+ (for GPU) or Ascend CANN (for NPU)
Setup
bash# Clone the repository git clone https://github.com/your-org/MindPipe.git cd MindPipe # Create environment conda create -n mindpipe python=3.10 -y conda activate mindpipe # Install dependencies git submodule update --init --recursive pip install -r requirements.txt
Optional: VLMEvalKit
For multimodal evaluation, initialize the VLMEvalKit submodule or set VLMEVALKIT_ROOT:
bashgit submodule update --init third_party/VLMEvalKit # or export VLMEVALKIT_ROOT=/path/to/existing/VLMEvalKit
π Reproducibility
The scripts/repro/ directory contains ready-to-use benchmark launchers:
bash# Dry run (print commands without executing) DRY_RUN=true bash scripts/repro/run_qlora_adapted_models_text_suite.sh # Filter specific models MODEL_FILTER=qwen3 bash scripts/repro/run_mquantpp_awq_vlm_serial_suite.sh
Available scripts include:
run_qlora_adapted_models_text_suite.shrun_qalora_adapted_models_text_suite.shrun_mquantpp_awq_vlm_serial_suite.shrun_qwen2_5_vl_gptq_vlm_suite.shrun_qwen3_vl_2b_gptq_suite.sh
Compression LoRA finetuning launchers are grouped under scripts/finetuning/:
flatquant_lora_auto_gpu.shautomatically dispatches by model config.llm/flatquant_lora_llm_gpu.shis for text-only LLMs.vlm/flatquant_lora_vlm_gpu.shis for MiniCPM-V, Qwen2.5-VL, and Qwen3-VL.qwen3_5/flatquant_lora_qwen3_5_gpu.shis for dense Qwen3.5/Qwen3.6 VLMs.
π Output Structure
results/
βββ <model>/<algorithm>/<run_spec>/
β βββ metrics.json # Evaluation results & run metadata
β βββ artifacts.json # Algorithm details, calibration settings, checkpoints
βββ <model>/<execution_order>/<algorithm1>__<algorithm2>/<run_spec>/
βββ metrics.json
β οΈ Known Limitations
| Limitation | Status |
|---|---|
| QuaRot / SpinQuant not NPU-ready | β³ |
| MQuant GPU-only | β³ |
| QA-LoRA CUDA-only, no AutoGPTQ export | πΆ |
| QLoRA W2/W3 use fake-quant fallback on NPU | βΉοΈ |
| Compression LoRA currently requires FlatQuant + fixed-mask pruning | βΉοΈ |
| Qwen3.5-MoE / Qwen3.6-35B-A3B Compression LoRA is not supported yet | β³ |
| Custom runtime wrapper reload is method-dependent | βΉοΈ |
π Citation & Acknowledgements
MindPipe builds upon the following outstanding research. Please cite the original papers when using their methods:
<details> <summary><b>Click to see referenced works</b></summary>- AWQ β Activation-aware Weight Quantization
- GPTQ β Accurate Post-Training Quantization for Generative Pre-trained Transformers
- QuaRot β Outlier-Free Quantization via Rotations
- SpinQuant β Rotation-Based Quantization
- FlatQuant β Flatness-Aware Quantization
- SmoothQuant β Accurate and Efficient Post-Training Quantization
- OmniQuant β Omnidirectionally Calibrated Quantization
- SplitQuant β Split Quantization
- QLoRA β Efficient Finetuning of Quantized LLMs
- QA-LoRA β Quantization-Aware Low-Rank Adaptation
- Wanda β Pruning by Weights and Activations
- SparseGPT β Massive Language Models Can Be Accurately Pruned in One-Shot
- FLAP β Fluctuation-based Adaptive Structured Pruning
- ShortGPT β Layers in LLMs are More Redundant Than You Expect
- LLM-Pruner β On the Structural Pruning of Large Language Models
- ALPS β Adaptive Layer-wise Pruning and Sparsification
<div align="center">
β If you find MindPipe useful, please consider giving it a star!
Built with β€οΈ for the model compression community
</div>Contributors
Showing top 6 contributors by commit count.
