GitPedia
AmusementClub

AmusementClub/vs-mlrt

Efficient CPU/GPU ML Runtimes for VapourSynth (with built-in support for waifu2x, DPIR, RealESRGANv2/v3, Real-CUGAN, RIFE, SCUNet, ArtCNN and more!)

30 Releases
Latest: 2d ago
v16.1.test1LatestPre-release
github-actions[bot]github-actions[bot]ยท2d agoยทJune 25, 2026
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [11.1](https://docs.nvidia.com/deeplearning/tensorrt/11.1.0/getting-started/release-notes-11/11.1.0.html).

๐Ÿ“ฆ General

  • Upgraded to CUDA 13.3.0.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v16.test1...v16.1.test1
v16.test1Pre-release
github-actions[bot]github-actions[bot]ยท3w agoยทJune 6, 2026
GitHub

๐Ÿ“ฆ vsmlrt.py

  • __Breaking__: For the `TRT` backend, fp16 inference of built-in models requires either [`onnxconverter-common`](https://pypi.org/project/onnxconverter-common/) or [`nvidia-modelopt`](https://github.com/nvidia/Model-Optimizer#install) Python packages to be installed. bf16 inference is not supported yet.

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [11.0](https://docs.nvidia.com/deeplearning/tensorrt/11.0.0/getting-started/release-notes-11/11.0.0.html).

๐Ÿ“ฆ TRT-RTX

  • Upgraded to TensorRT-RTX [1.5](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/getting-started/release-notes-1/1.5.html).

๐Ÿ“ฆ General

  • Upgraded to CUDA 13.2.1.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.16...v16.test1
v15.16: latest TensorRT librariesv15.16
github-actions[bot]github-actions[bot]ยท3mo agoยทMarch 26, 2026
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.16.0](https://docs.nvidia.com/deeplearning/tensorrt/10.16.0/getting-started/release-notes-10/10.16.0.html).

๐Ÿ“ฆ TRT-RTX

  • Upgraded to TensorRT-RTX [1.4](https://docs.nvidia.com/deeplearning/tensorrt-rtx/1.4/getting-started/release-notes-1/1.4.html).

๐Ÿ“ฆ General

  • Upgraded to CUDA 13.2.0.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.15...v15.16
v15.15: latest TensorRT librariesv15.15
github-actions[bot]github-actions[bot]ยท4mo agoยทFebruary 5, 2026
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.15.1](https://docs.nvidia.com/deeplearning/tensorrt/10.15.1/getting-started/release-notes-10/10.15.1.html).

๐Ÿ“ฆ General

  • Upgraded to CUDA 13.1.1 and cuDNN [9.19.0](https://docs.nvidia.com/deeplearning/cudnn/backend/v9.19.0/release-notes.html#cudnn-9-19-0).

๐Ÿ“ฆ vsmlrt.py

  • Added support for ArtCNN v1.5.0 models.
  • Added bf16 I/O support for the `MIGX` backend.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.14...v15.15
v15.14.rtx: latest TensorRT-RTX librariesv15.14.rtxPre-release
github-actions[bot]github-actions[bot]ยท7mo agoยทNovember 11, 2025
GitHub

๐Ÿ“ฆ TRT-RTX

  • Upgraded to TensorRT-RTX [1.2](https://docs.nvidia.com/deeplearning/tensorrt-rtx/v1.2/getting-started/release-notes.html#tensorrt-rtx-1-2).
  • Added engine validity check for debugging invalid engines.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.14...v15.14.rtx
v15.14: latest TensorRT librariesv15.14
github-actions[bot]github-actions[bot]ยท7mo agoยทNovember 8, 2025
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.14.1](https://docs.nvidia.com/deeplearning/tensorrt/10.14.1/getting-started/release-notes.html#tensorrt-10-14-1).

๐Ÿ“ฆ General

  • Upgraded to CUDA 13.0.2 and cuDNN [9.13.0](https://docs.nvidia.com/deeplearning/cudnn/backend/v9.13.0/release-notes.html#cudnn-9-13-0).

๐Ÿ“ฆ ORT

  • Upgraded to ONNX Runtime 1.23.0 ([`ecb26fb`](https://github.com/microsoft/onnxruntime/tree/ecb26fb7754d7c9edf24b1844ea807180a2e3e23)).

๐Ÿ“ฆ NCNN_VK

  • Upgraded to the latest ncnn ([`86efe80`](https://github.com/Tencent/ncnn/tree/86efe80b50408bfeca79761edcb3fa4b4e513331)) to fix hangs with NVIDIA 565 or later drivers.
  • Added support for fp16 I/O, similar to other existing supported backends.

๐Ÿ“ฆ vsmlrt.py

  • Added support for ArtCNN R16F96 Chroma model.
  • Added `output_format` parameter to non-cuda ort backends.
  • Added fp16 I/O support for the `TRT_RTX` backend.
  • Added optional support for fp16 conversion using [TensorRT model optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) for `TRT_RTX`.
  • Attempt to regenerate the engine after the failure of engine compilation for `TRT`, `MIGX` and `TRT_RTX`.
  • Remove extraneous plugin check by @Rukario in https://github.com/AmusementClub/vs-mlrt/pull/135
  • Improve `TRT_RTX` in handling fp16 conversion and standalone usage by @abihf in https://github.com/AmusementClub/vs-mlrt/pull/140
  • fix: use correct path for checking alter engine size by @shssoichiro in https://github.com/AmusementClub/vs-mlrt/pull/144
  • + 1 more
v15.13.ncnnPre-release
github-actions[bot]github-actions[bot]ยท9mo agoยทSeptember 26, 2025
GitHub

๐Ÿ“ฆ NCNN_VK

  • Upgraded to the latest ncnn ([`86efe80`](https://github.com/Tencent/ncnn/tree/86efe80b50408bfeca79761edcb3fa4b4e513331)) to fix hangs with NVIDIA 565 or later drivers.
  • Added support for fp16 I/O, similar to other existing supported backends.

๐Ÿ“ฆ vsmlrt.py

  • Added support for ArtCNN [v1.4.0](https://github.com/Artoriuz/ArtCNN/releases/tag/v1.4.0) models.

๐Ÿ“ฆ Known issue

  • Using the `NCNN_VK(fp16=True)` backend on the ArtCNN R8F64 chroma model may exhibit chroma shift with irregular input resolutions.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.13.cu13...v15.13.ncnn
v15.13.cu13: latest TensorRT librariesv15.13.cu13Pre-release
github-actions[bot]github-actions[bot]ยท9mo agoยทSeptember 6, 2025
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.13.3](https://docs.nvidia.com/deeplearning/tensorrt/10.13.3/getting-started/release-notes.html#tensorrt-10-13-3).

๐Ÿ“ฆ General

  • Upgraded to CUDA 13.0.1 and cuDNN [9.13.0](https://docs.nvidia.com/deeplearning/cudnn/backend/v9.13.0/release-notes.html#cudnn-9-13-0).

๐Ÿ“ฆ vsmlrt.py

  • Attempt to regenerate the engine after the failure of engine compilation for `TRT`, `MIGX` and `TRT_RTX`.

๐Ÿ“ฆ ORT

  • Upgraded to ONNX Runtime 1.23.0 ([`ecb26fb`](https://github.com/microsoft/onnxruntime/tree/ecb26fb7754d7c9edf24b1844ea807180a2e3e23)).
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.13.ort...v15.13.cu13
v15.13.ort: latest ONNX Runtime librariesv15.13.ortPre-release
github-actions[bot]github-actions[bot]ยท10mo agoยทAugust 31, 2025
GitHub

๐Ÿ“ฆ ORT

  • Upgraded to ONNX Runtime 1.23.0 ([`4754a1d`](https://github.com/microsoft/onnxruntime/tree/4754a1d64e5920a715b0396906f339e6c15742a0)) and added support for Nvidia RTX 50-series GPUs.
  • Support for attention operations in ONNX Runtime for LLMs is disabled.
  • Support for 900 and 10-series GPUs are dropped from `ORT_CUDA`.

๐Ÿ“ฆ General

  • Upgraded to cuDNN [9.12.0](https://docs.nvidia.com/deeplearning/cudnn/backend/v9.12.0/release-notes.html#cudnn-9-12-0).

๐Ÿ“ฆ vsmlrt.py

  • Added optional support for fp16 conversion using [TensorRT model optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) for `TRT_RTX`.

๐Ÿ“ฆ Community contributions

  • `TRT_RTX` improvements by @abihf in https://github.com/AmusementClub/vs-mlrt/pull/140

๐Ÿ“ฆ Known issues

  • fp16 inference for RIFE v2 and SAFA models, as well as fp32/fp16 inference for some SwinIR models, are not currently working in `TRT_RTX`.
  • The old cudnn v8 installation should be removed; otherwise, DLL loading may not work.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.13.rtx...v15.13.ort
v15.13.rtx: experimental TensorRT-RTX backendv15.13.rtxPre-release
github-actions[bot]github-actions[bot]ยท10mo agoยทAugust 15, 2025
GitHub

๐Ÿ“ฆ TRT-RTX

  • Upgraded to TensorRT-RTX [1.1](https://docs.nvidia.com/deeplearning/tensorrt-rtx/v1.1/getting-started/release-notes.html#tensorrt-rtx-1-1).

๐Ÿ“ฆ vsmlrt.py

  • Added support for ArtCNN R16F96 Chroma model.
  • Added `output_format` parameter to non-cuda ort backends.
  • Added fp16 I/O support for the `TRT_RTX` backend.

๐Ÿ“ฆ Community contributions

  • Remove extraneous plugin check for RIFE by @Rukario in https://github.com/AmusementClub/vs-mlrt/pull/135

๐Ÿ“ฆ Known issues

  • fp16 inference for RIFE v2 and SAFA models is currently not supported in the `TRT_RTX` backend.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.13...v15.13.rtx
v15.13: latest TensorRT librariesv15.13
github-actions[bot]github-actions[bot]ยท11mo agoยทJuly 24, 2025
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.13.0](https://docs.nvidia.com/deeplearning/tensorrt/10.13.0/getting-started/release-notes.html#tensorrt-10-13-0) and CUDA 12.9.1.

๐Ÿ“ฆ vsmlrt.py

  • Fix input name.
  • Fix error handling for `Expr`.

๐Ÿ“ฆ TRT-RTX

  • Added support for dynamic shapes.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.12...v15.13
  • ![](https://img.shields.io/github/downloads/AmusementClub/vs-mlrt/v15.13/vsmlrt-windows-x64-cuda.v15.13.7z.002?label=downloads)
v15.12: latest TensorRT librariesv15.12
github-actions[bot]github-actions[bot]ยท1y agoยทJune 13, 2025
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.12.0](https://docs.nvidia.com/deeplearning/tensorrt/10.12.0/getting-started/release-notes.html#tensorrt-10-12-0).

๐Ÿ“ฆ vsmlrt.py

  • Added support for the [SAFA](https://github.com/hzwer/Practical-RIFE/tree/9aff2a278b1fb5085e137b4f4b748e518bf7ab26?tab=readme-ov-file#video-enhancement) v0.5 models.
  • Prioritize the use of `akarin.Expr`.
  • Fix tile size check in `SAFA()`.

๐Ÿ“ฆ misc

  • Fix tile size check in `vsort` and `vsov`.
  • Added __experimental__ support for the [TensorRT-RTX](https://developer.nvidia.com/tensorrt-rtx) library. This `TRT_RTX` backend is under development, check pre-releases with the `.rtx` suffix.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.11...v15.12
  • ![](https://img.shields.io/github/downloads/AmusementClub/vs-mlrt/v15.12/vsmlrt-windows-x64-cuda.v15.12.7z.002?label=downloads)
v15.11.rtx: experimental TensorRT-RTX backendv15.11.rtxPre-release
github-actions[bot]github-actions[bot]ยท1y agoยทJune 12, 2025
GitHub

๐Ÿ“ฆ Known issues

  • Dynamic shape is not supported.
  • For the vsmlrt.py wrapper, fp16 processing currently requires the [onnxconverter-common](https://pypi.org/project/onnxconverter-common/) package, and fp16 input/output is not supported.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.11...v15.11.rtx
v15.11: latest TensorRT librariesv15.11
github-actions[bot]github-actions[bot]ยท1y agoยทMay 15, 2025
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.11.0](https://docs.nvidia.com/deeplearning/tensorrt/10.11.0/getting-started/release-notes.html#tensorrt-10-11-0).
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.10...v15.11
  • ![](https://img.shields.io/github/downloads/AmusementClub/vs-mlrt/v15.11/vsmlrt-windows-x64-cuda.v15.11.7z.002?label=downloads)
v15.10: latest TensorRT librariesv15.10
github-actions[bot]github-actions[bot]ยท1y agoยทMay 1, 2025
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.10.0](https://docs.nvidia.com/deeplearning/tensorrt/10.10.0/getting-started/release-notes.html#tensorrt-10-10-0).

๐Ÿ“ฆ NCNN_VK

  • Added `DeviceProperties()` to query information about a Vulkan device.

๐Ÿ“ฆ General

  • Upgraded to CUDA 12.9.0.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.9...v15.10
  • ![](https://img.shields.io/github/downloads/AmusementClub/vs-mlrt/v15.10/vsmlrt-windows-x64-cuda.v15.10.7z.002?label=downloads)
v15.9: latest TensorRT librariesv15.9
github-actions[bot]github-actions[bot]ยท1y agoยทMarch 5, 2025
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.9.0](https://docs.nvidia.com/deeplearning/tensorrt/10.9.0/getting-started/release-notes.html#tensorrt-10-9-0).

๐Ÿ“ฆ vsmlrt.py

  • Better overlap defaults for SCUNet and DPIR by @damster101 in https://github.com/AmusementClub/vs-mlrt/pull/126
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.8...v15.9
  • ![](https://img.shields.io/github/downloads/AmusementClub/vs-mlrt/v15.9/vsmlrt-windows-x64-cuda.v15.9.7z.002?label=downloads)
v15.8: Blackwell support, latest TensorRT and MIGraphX librariesv15.8
github-actions[bot]github-actions[bot]ยท1y agoยทJanuary 24, 2025
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.8.0](https://docs.nvidia.com/deeplearning/tensorrt/10.8.0/getting-started/release-notes.html#tensorrt-10-8-0), which adds support for Blackwell GPUs including RTX 50-series.
  • The release archive is split into 2GB volumes (`.7z.001`, `.7z.002`).

๐Ÿ“ฆ MIGX

  • Upgraded to MIGraphX 2.12.0 [`6acc1f9`](https://github.com/ROCm/AMDMIGraphX/commit/6acc1f957bab2d2b23b3adffccd29f7e10178986).

๐Ÿ“ฆ General

  • Upgraded to CUDA 12.8.0 and HIP 6.2.4.

๐Ÿ“ฆ vsmlrt.py

  • Added support for [ArtCNN](https://github.com/Artoriuz/ArtCNN) v1.2.0 models.
  • Add `tiling_optimization_level` and `l2_limit_for_tiling` options to the `TRT` backend for memory-bandwidth-limited models ([docs](https://docs.nvidia.com/deeplearning/tensorrt/10.8.0/inference-library/advanced.html#tiling-optimization)).
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.7...v15.8
  • ![](https://img.shields.io/github/downloads/AmusementClub/vs-mlrt/v15.8/vsmlrt-windows-x64-cuda.v15.8.7z.002?label=downloads)
v15.7: latest TensorRT libraries, ONNX Runtime and MIGraphX interface improvementsv15.7
github-actions[bot]github-actions[bot]ยท1y agoยทDecember 3, 2024
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.7.0](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1070/release-notes/index.html#rel-10-7-0).

๐Ÿ“ฆ ORT_DML

  • Fixed blank output for the first returned frame reported by @Mr-Z-2697 in https://github.com/AmusementClub/vs-mlrt/issues/117

๐Ÿ“ฆ MIGX

  • Allow `num_streams` > 1 by @abihf in https://github.com/AmusementClub/vs-mlrt/pull/113

๐Ÿ“ฆ ORT_COREML

  • Add support for ML program in by @yuygfgg in https://github.com/AmusementClub/vs-mlrt/pull/116

๐Ÿ“ฆ General

  • Upgraded to CUDA 12.6.3.

๐Ÿ“ฆ vsmlrt.py

  • Added support for RIFE v4.26 heavy model.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.6...v15.7
  • ![](https://img.shields.io/github/downloads/AmusementClub/vs-mlrt/v15.7/vsmlrt-windows-x64-cuda.v15.7.7z?label=downloads)
v15.6: latest TensorRT and OpenVINO librariesv15.6
github-actions[bot]github-actions[bot]ยท1y agoยทNovember 1, 2024
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.6.0](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1060/release-notes/index.html#rel-10-6-0).

๐Ÿ“ฆ OV

  • Upgraded to OpenVINO [2024.5.0](https://github.com/openvinotoolkit/openvino/tree/5833781ddbc476d77cf5593f1f8b34758988b9a8), which adds support for Xe2 GPU and NPU 4 on Lunar Lake.

๐Ÿ“ฆ MIGX

  • Fix missing precision check.

๐Ÿ“ฆ General

  • Upgraded to CUDA 12.6.2.

๐Ÿ“ฆ vsmlrt.py

  • Added support for RIFE v4.25 lite and heavy models.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.5...v15.6
  • ![](https://img.shields.io/github/downloads/AmusementClub/vs-mlrt/v15.6/vsmlrt-windows-x64-cuda.v15.6.7z?label=downloads)
v15.5: latest TensorRT library, CoreML backendv15.5
github-actions[bot]github-actions[bot]ยท1y agoยทOctober 1, 2024
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.5.0](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1050/release-notes/index.html#rel-10-5-0).
  • Volta GPUs (TITAN V, V100) are no longer supported.

๐Ÿ“ฆ ORT

  • Fix MacOS CoreML support for vsort by @yuygfgg in https://github.com/AmusementClub/vs-mlrt/pull/106.
  • This pull request also added the`ORT_COREML` backend to vsmlrt.py.

๐Ÿ“ฆ General

  • Upgraded to CUDA 12.6.1.

๐Ÿ“ฆ vsmlrt.py

  • Added support for RIFE v4.25 and v4.26 models.
  • Added automatic batch inference support via `batch_size` option in `inference()` and `flexible_inference()`, which may improve device utilization for inference on small inputs using some small models.
  • On the one hand, batching improves utilization by creating more work for each kernel invocation and reducing quantization inefficiency of kernel tiles in bulk parallelism. It also reduces average kernel launch and synchronization overhead per work.
  • On the other hand, however, batching causes cache misses and inserts bubbles in the pipeline that may degrade performance.
  • This feature requires flexible output support starting with vs-mlrt v15 and is inspired by https://github.com/styler00dollar/VSGAN-tensorrt-docker/commit/ac47012b9313fe76f37914601dafea82df0e94e6.
  • Note that not all onnx models are supported.
  • Future RIFE v2 models will be fixed to support batch inference.
  • benchmark:
  • + 17 more
v15.4: latest TensorRT libraryv15.4
github-actions[bot]github-actions[bot]ยท1y agoยทSeptember 7, 2024
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.4.0](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1040/release-notes/index.html#rel-10-4-0).

๐Ÿ“ฆ General

  • Upgraded to CUDA 12.6.0.

๐Ÿ“ฆ vsmlrt.py

  • Added support for Ani4K-v2 model by @srk24 in https://github.com/AmusementClub/vs-mlrt/pull/105
  • Added support for RIFE v4.23 and v4.24 models.
  • Add `max_tactics` option to the `TRT` backend, which can reduce engine build time by limiting the number of tactics to time.
  • By default, TensorRT will determine the number of tactics based on its own heuristic.
  • ---

๐Ÿ“ฆ Batch Inference (Preview)

  • This feature requires flexible output support starting with vs-mlrt v15 and is inspired by https://github.com/styler00dollar/VSGAN-tensorrt-docker/commit/ac47012b9313fe76f37914601dafea82df0e94e6.
  • Note that not all onnx models are supported.
  • Preliminary benchmark:
  • NVIDIA GeForce RTX 4090
  • driver 560.94
  • Windows Server 2019
  • python 3.12.6, vapoursynth-classic R57.A10
  • input: 720x480 RGBS
  • + 12 more
v15.3: MIGraphX on Windowsv15.3
github-actions[bot]github-actions[bot]ยท1y agoยทAugust 21, 2024
GitHub

๐Ÿ“ฆ MIGX

  • Add __experimental__ [MIGraphX](https://github.com/ROCm/AMDMIGraphX) support on Windows. MIGraphX is AMD's graph optimization engine to accelerate machine learning model inference.
  • [Supported GPUs](https://rocm.docs.amd.com/projects/install-on-windows/en/docs-6.1.2/reference/system-requirements.html#windows-supported-gpus):
  • gfx1030: Radeon RX 6950 XT, Radeon RX 6900 XT, Radeon RX 6800 XT, Radeon RX 6800, ...
  • gfx1100: Radeon RX 7900 XTX, Radeon RX 7900 XT, ...
  • gfx1101: Radeon RX 7700 XT, ...
  • gfx1102: Radeon RX 7600
  • Relevant archives include:
  • [`vsmlrt-windows-x64-migraphx.<version>.7z`](https://github.com/AmusementClub/vs-mlrt/releases/download/v15.3/vsmlrt-windows-x64-migraphx.v15.3.7z): the all-in-one archive, contains `vsmlrt.py` Python wrapper, some built-in ONNX models, `vsmigx`/`vsov`/`vsort`/`vsncnn` plugins and runtime.
  • + 5 more

๐Ÿ“ฆ Known limitation

  • The `MIGX` backend in the vsmlrt.py wrapper does not support device selection and will always use the default device (`device_id=0`).

๐Ÿ“ฆ vsmlrt.py

  • Added support for RIFE v4.22 (lite) models.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.2...v15.3
  • ![](https://img.shields.io/github/downloads/AmusementClub/vs-mlrt/v15.3/vsmlrt-windows-x64-cuda.v15.3.7z?label=downloads)
v15.2: latest TensorRT libraryv15.2
github-actions[bot]github-actions[bot]ยท1y agoยทAugust 7, 2024
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.3.0](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1030/release-notes/index.html#rel-10-3-0).
  • Fixed performance regression of RIFE and SAFA models starting with vs-mlrt v14.test4. This version may still be slightly slower than vs-mlrt [v14.test3](https://github.com/AmusementClub/vs-mlrt/releases/tag/v14.test3) under some conditions, however.

๐Ÿ“ฆ General

  • Upgraded to CUDA 12.5.1.

๐Ÿ“ฆ vsmlrt.py

  • Added support for RIFE v4.19 ~ v4.21 models.
  • Added support for ArtCNN R8F64 (chroma) models.
  • Deprecated ArtCNN C4F32 models based on developer's request, but compatibility at the vsmlrt.py level will be guaranteed.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15.1...v15.2
  • ![](https://img.shields.io/github/downloads/AmusementClub/vs-mlrt/v15.2/vsmlrt-windows-x64-cuda.v15.2.7z?label=downloads)
v15.1: latest TensorRT library v15.1
github-actions[bot]github-actions[bot]ยท1y agoยทJuly 4, 2024
GitHub

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.2.0](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1020/release-notes/index.html#rel-10-2-0).
  • Add TensorRT release package (`vsmlrt-windows-x64-tensorrt`). https://github.com/AmusementClub/vs-mlrt/issues/102
  • This package is a strict subset of the CUDA release package, with cuDNN, cuBLAS libraries and support for `ORT_CUDA` backend removed.
  • It supports `TRT`, `OV_*`, `ORT_CPU`, `ORT_DML` and `NCNN_VK` backends.

๐Ÿ“ฆ known issue

  • Accoding to the [documentation](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1020/release-notes/index.html#rel-10-2-0),
  • `There is an up to 4x performance regression for networks containing "GridSample" ops compared to TensorRT 9.2.`
  • This affects RIFE and SAFA models.
  • vs-mlrt [v14.test3](https://github.com/AmusementClub/vs-mlrt/releases/tag/v14.test3) is the latest one that is not affected. This will be fixed in the next release by TensorRT 10.3.0.

๐Ÿ“ฆ General

  • Upgraded to CUDA 12.5.0.

๐Ÿ“ฆ vsmlrt.py

  • Added support for [RIFE v4.17 lite and v4.18](https://github.com/hzwer/Practical-RIFE/tree/e992041754c9a81e0566a0adce7589f79e12f0e8?tab=readme-ov-file#model-list) models.
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v15...v15.1
v15: latest TensorRT libraryv15
github-actions[bot]github-actions[bot]ยท2y agoยทJune 15, 2024
GitHub

๐Ÿ“ฆ plugins

  • Added parameter `flexible_output_prop` for flexible output:
  • Traditionally, all plugins can only support onnx models with one or three output channels, due to vapoursynth's limitation.
  • By using the new flexible output feature, plugins can support onnx models with arbitrary number of output planes.
  • ```python3
  • from typing import TypedDict
  • class Output(TypedDict):
  • clip: vs.VideoNode
  • num_planes: int
  • + 10 more

๐Ÿ“ฆ vsmlrt.py

  • Added support for [RIFE v4.17](https://github.com/hzwer/Practical-RIFE/tree/f3e48ceb02e4c21bc8868b03994b98f3402ffb3d?tab=readme-ov-file#model-list) models.
  • Added support for [ArtCNN](https://github.com/Artoriuz/ArtCNN) models optimised for anime content. The chroma variants are not supported on previous versions of vs-mlrt, because they require the flexible output feature.
  • Added function `flexible_inference` for flexible output:
  • The above sample is simplified as
  • ```python3
  • output_planes = flexible_inference(src, network_path) # type: list[vs.VideoNode]
  • ```

๐Ÿ“ฆ TRT

  • Upgraded to TensorRT [10.1.0](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1010/release-notes/index.html#rel-10-1-0).

๐Ÿ“ฆ known issue

  • Accoding to the [documentation](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1010/release-notes/index.html#rel-10-1-0),
  • `There is an up to 4x performance regression for networks containing "GridSample" ops compared to TensorRT 9.2.`
  • This affects RIFE and SAFA models.
  • vs-mlrt [v14.test3](https://github.com/AmusementClub/vs-mlrt/releases/tag/v14.test3) is the latest one that is not affected.

๐Ÿ“ฆ Community contributions

  • Fix `multiple flexible_output_prop keyword argument` error by @LightArrowsEXE in https://github.com/AmusementClub/vs-mlrt/pull/97
  • Fix missing spaces in exceptions by @LightArrowsEXE in https://github.com/AmusementClub/vs-mlrt/pull/98
  • Full Changelog: https://github.com/AmusementClub/vs-mlrt/compare/v14...v15
v14: latest librariesv14
github-actions[bot]github-actions[bot]ยท2y agoยทApril 25, 2024
GitHub

๐Ÿ“ฆ General

  • [External models](https://github.com/AmusementClub/vs-mlrt/releases/tag/external-models) are no longer packaged.

๐Ÿ“ฆ vsmlrt.py

  • Plugin invocation order in the `get_plugin_path()` function is sorted to reduce memory consumption.
  • Added support for [RIFE v4.7 ~ v4.16](https://github.com/hzwer/Practical-RIFE/tree/6f9dc10493b9f15391b68a5002f8e5201159634e?tab=readme-ov-file#model-list) (lite, ensemble) models.
  • Added support for [SCUNet](https://github.com/cszn/SCUNet/tree/52e440a80a655b01e0b41e9dd9bfe599bc11625e) models for image denoising.

๐Ÿ“ฆ plugin and runtime libraries

  • Upgraded to TensorRT [10.0.1](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1001/release-notes/index.html#rel-10-0-1).
  • Maxwell and Pascal GPUs are no longer supported. Other backends still support these GPUs.
  • Reduce GPU memory usage for dynamically shaped engines when the actual tile size is smaller than the maximum tile size set during engine building.
  • Reduced engine build time.
  • Added [long path](https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry#enable-long-paths-in-windows-10-version-1607-and-later) support for engines on Windows.
  • cuDNN is no longer a strict runtime dependency.

๐Ÿ“ฆ vsmlrt.py

  • The cuDNN tactic is no longer enabled by default.
  • TF32 acceleration is disabled by default.
  • The maximum workspace is set to `None` for the total memory size of the GPU.
  • Add parameters `builder_optimization_level`, `max_aux_streams`, `bf16` (https://github.com/AmusementClub/vs-mlrt/issues/64), `custom_env`, `custom_args`, `short_path` and `engine_folder` (https://github.com/AmusementClub/vs-mlrt/issues/90):
  • `builder_optimization_level`: "adjust how long TensorRT should spend searching for tactics with potentially better performance" [link](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1001/developer-guide/index.html#opt-builder-optimization-level)
  • `max_aux_streams`: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." [link](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1001/developer-guide/index.html#within-inference-multi-streaming)
  • `bf16`: "TensorRT supports the bfloat16 (brain float) floating point format on NVIDIA Ampere and later architectures ... Note that not all layers support bfloat16." [link](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1001/developer-guide/index.html#bf16)
  • `custom_env`, `custom_args`: custom environment variable and arguments for trtexec engine build.
  • + 3 more

๐Ÿ“ฆ known issues

  • Accoding to the [documentation](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-1001/release-notes/index.html#rel-10-0-1), `There is an up to 4x performance regression for networks containing "GridSample" ops compared to TensorRT 9.2.`This affects RIFE and SAFA models.
  • trtexec may reports errors like:
  • `[E] Error[9]: Skipping tactic 0xded5318b4a444b84 due to exception Cask convolution execution`
  • `[E] Error[2]: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)`
  • This issue has been submitted to NVIDIA.

๐Ÿ“ฆ ORT

  • Upgraded to ONNX Runtime [v1.18.0](https://github.com/microsoft/onnxruntime/tree/e5947f57293045c60a7f44c5557bbc022be2f9e7).

๐Ÿ“ฆ interface

  • The `ORT_*` backends now support fp16 I/O. The semantics of the `fp16` flag in these backends is as follows:
  • Enabling `fp16` will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by the `output_format` option (`0 = fp32, 1 = fp16`).
  • Disabling `fp16` will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.

๐Ÿ“ฆ CUDA

  • Reduced execution overhead.
  • Added support for TF32 acceleration. This is disabled by default.
  • Added experimental `prefer_nhwc` flag to reduce the number of layout transformations when using tensor cores. This is disabled by default.

๐Ÿ“ฆ OV

  • Upgraded to OpenVINO [2024.2.0](https://github.com/openvinotoolkit/openvino/tree/4655dd6ce3fa7947d5f70c3552f31fdda50582d0).
  • Added experimental `OV_NPU` backend for Intel NPUs.

๐Ÿ“ฆ MIGX

  • Added support for [MIGraphX](https://github.com/ROCm/AMDMIGraphX) backend for AMD GPUs. Currently this backend is Linux only.

๐Ÿ“ฆ Community contributions

  • `scripts/vsmlrt.py`: update esrgan janai models by @hooke007 in https://github.com/AmusementClub/vs-mlrt/pull/53
  • `scripts/vsmlrt.py`: add more esrgan janai models by @hooke007 in https://github.com/AmusementClub/vs-mlrt/pull/82
  • `vsmigx`: allow fp16 input & output by @abihf in https://github.com/AmusementClub/vs-mlrt/pull/86
  • `scripts/vsmlrt.py`: fix fp16 precision issues of RIFE v2 representations by @charlessuh in https://github.com/AmusementClub/vs-mlrt/issues/66#issuecomment-1791986979

๐Ÿ“ฆ Benchmark

  • NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8
  • 1920x1080 RGBS, TRT backend, CUDA graphs enabled, fp16
  • Measurements: FPS / Device Memory (MB)
  • | model | 1 stream | 2 streams | 3 streams |
  • |:---------------------------------------------:|:----------------:|:-----------------:|:-----------------:|
  • | dpir color | 10.99 / 1715.172 | 11.62 / 3048.540 | 11.64 / 4381.912 |
  • | waifu2x upconv_7_{anime_style_art_rgb, photo} | 22.38 / 2016.352 | 32.66 / 3734.880 | 32.54 / 5453.404 |
  • | waifu2x cunet / cugan | 12.41 / 4359.284 | 15.53 / 8363.392 | 15.47 / 12367.504 |
  • + 5 more
v14.test4: latest TensorRT and ONNX Runtime librariesv14.test4Pre-release
github-actions[bot]github-actions[bot]ยท2y agoยทMarch 27, 2024
GitHub

๐Ÿ“‹ Changes

  • __The `TRT` backend no longer supports Maxwell and Pascal GPUs__. Other backends still support these GPUs. Same as those releases, the current release requires driver version >= 525.
  • Added support for [SwinIR](https://github.com/JingyunLiang/SwinIR/tree/6545850fbf8df298df73d81f3e8cba638787c8bd) models for image restoration, which are only supported by the `TRT` backend and the `ORT_CPU` backend from vs-mlrt v14.test4 or later. SwinIR-M and SwinIR-L models exhibit precision issue with fp16 implementation, this is under investigation.
  • Added support for [SCUNet](https://github.com/cszn/SCUNet/tree/52e440a80a655b01e0b41e9dd9bfe599bc11625e) models for image denoising, which are only supported by the `TRT` backend and the `ORT_CPU` backend from vs-mlrt v14.test4 or later.
  • Added `engine_folder` argument to the `TRT` backend in vsmlrt.py to specify custom directory for engines.
  • Starting with this pre-release, for dynamically shaped engines, the trt runtime allocates gpu memory based on the actual tile size, whereas in previous releases, the runtime would have to allocate gpu memory based on the maximum tile size set at engine compile time. This feature requires TensorRT 10 or later.
  • The `ORT_*` backends now support fp16 I/O. The semantics of the `fp16` flag is as follows:
  • Enabling `fp16` will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by the `output_format` option (`0 = fp32, 1 = fp16`).
  • Disabling `fp16` will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.
  • + 5 more

๐Ÿ“ฆ benchmark 1

  • [previous benchmark](https://github.com/AmusementClub/vs-mlrt/releases/tag/v14.test3)
  • RTX 4090
  • processor clock @ 2520 MHz
  • Intel Icelake server @ 2100 MHz
  • Driver 551.86
  • Windows 10 21H2 (19044.1415)
  • TensorRT 10.0.0
  • VapourSynth-Classic R57.A8, vapoursynth-plugin v0.96g3
  • + 2 more

๐Ÿ“ฆ general

  • | model | 1 stream | 2 streams | 3 streams |
  • |:---------------------------------------------:|:-----------------:|:----------------:|:-----------------:|
  • | dpir gray | 22.05 / 1818.796 | 25.30 / 3111.114 | 25.33 / 4403.488 |
  • | dpir color | 18.30 / 1851.632 | 25.13 / 3176.808 | 25.17 / 4501.984 |
  • | | | | |
  • | waifu2x upconv_7_{anime_style_art_rgb, photo} | 20.45 / 2148.716 | 41.22 / 3867.240 | 61.21 / 5585.764 |
  • | waifu2x upresnet10 | 17.91 / 1716.588 | 34.53 / 2941.540 | 42.33 / 4166.492 |
  • | waifu2x cunet / cugan | 13.89 / 4391.292 | 25.74 / 8346.248 | 25.96 / 12301.202 |
  • + 11 more

๐Ÿ“ฆ rife

  • v2, fp16 i/o
  • | version | 1 stream | 2 streams | 3 streams | 4 streams | 5 streams |
  • |:-----------------------------------------------:|:--------------:|:--------------:|:---------------:|:---------------:|:---------------:|
  • | v4.4-v4.5 | 136.92/778.432 | 273.80/1149.204 | 414.80/1522.028 | 553.70/1892.796 | 574.31/2263.568 |
  • | v4.6 | 136.01/800.960 | 275.26/1192.212 | 411.01/1585.516 | 544.30/1979.764 | 550.01/2368.020 |
  • | v4.7-v4.9 | 98.20/1302.724 | 195.78/2187.548 | 210.12/3074.420 | 210.45/3957.196 | 210.66/4844.068 |
  • | v4.10-v4.15 | 84.41/1595.592 | 160.93/2773.280 | 161.96/3953.020 | 162.04/5132.760 | 162.07/6310.448 |
  • | {v4.12, v4.13, v4.15, v4.16}_lite | 93.39/1333.444 | 187.32/2255.132 | 197.71/3178.872 | 198.01/4098.508 | 197.95/5022.248 |
  • + 1 more

๐Ÿ“ฆ benchmark 2

  • [previous benchmark](https://github.com/AmusementClub/vs-mlrt/releases/tag/v13.2)
  • NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8
  • Measurements: (1080p, fp16) FPS / Device Memory (MB)
  • | model | ORT_CUDA NCHW | ORT_CUDA NHWC | ORT_DML |
  • |:---------------------------------:|:----------------:|:---------------:|:--------------:|
  • | dpir color | 4.54 / 2573.3 | 5.98 / 2470.9 | 8.45 / 2364.5 |
  • | dpir color (2 streams) | 4.66 / 4854.9 | 6.30 / 4680.8 | 9.48 / 4630.9 |
  • | waifu2x upconv7 | 10.98 / 5432.5 | 3.18 / 3017.8 | 12.48 / 4493.0 |
  • + 9 more

๐Ÿ“ฆ benchmark 3

  • NVIDIA GeForce RTX 2080 Ti, 4352 shaders @ 1700 MHz, driver 552.22, Windows 10 LTSC 21H2 (19044.1415), Python 3.11.9, vapoursynth-classic R57.A8
  • Measurements: (1080p, fp16) FPS / Device Memory (MB)
  • | model | TRT | ORT_CUDA | ORT_DML | ORT_CUDA NHWC |
  • |:------------------------------------------:|:-------------:|:--------------:|:------------:|:-------------:|
  • | dpir color (1 stream) | 7.08 / 1899 | 3.10 / 2602 | 4.99 / 2341 | 4.26 / 2411 |
  • | dpir color (2 streams) | 8.06 / 3376 | 3.30 / 5016 | 5.85 / 4619 | 4.74 / 4650 |
  • | | | | | |
  • | waifu2x upconv7 (1 stream) | 11.47 / 2014 | 7.01 / 4949 | 7.45 / 4501 | 1.59 / 2923 |
  • + 20 more
v14.test3: latest TensorRT, MIGraphX backend v14.test3Pre-release
github-actions[bot]github-actions[bot]ยท2y agoยทDecember 3, 2023
GitHub

๐Ÿ“‹ Changes

  • Same as those releases, it requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.
  • TensorRT 9.2.0 is officially documented as `for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopperโ„ข Superchip only`. The Windows build is downloaded from [here](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-9.2.0.5.windows10.x86_64.cuda-12.2.llm.beta.zip), and can be used on other GPU models.
  • Users should use the same version of TensorRT as provided (9.2.0) because runtime version checking is [disabled](https://github.com/AmusementClub/vs-mlrt/commit/09a26ecec48c1b015bc1537f968db19473ea69b8) in this release.
  • Added support for [AnimeJaNai V3](https://github.com/the-database/mpv-upscale-2x_animejanai/releases/tag/3.0.0) models, contributed by contributed by @hooke007 in https://github.com/AmusementClub/vs-mlrt/pull/82.
  • Added support for [RIFE v4.13 ~ v4.16](https://github.com/hzwer/Practical-RIFE/tree/6f9dc10493b9f15391b68a5002f8e5201159634e?tab=readme-ov-file#model-list) (lite, ensemble) models, which are also available for previous vs-mlrt releases (simply download the new model file [here](https://github.com/AmusementClub/vs-mlrt/releases/tag/external-models) and update `vsmlrt.py`).
  • The v4.13 ~ v4.15 models should have the same execution speed as the v4.10 - v4.12 models.
  • The v4.13 lite model, the v4.15 lite model and the v4.16 lite model should all have the same execution speed as the v4.12 lite model, while the v4.14 lite model may run slower.
  • Added support for fractional video frame interpolation in RIFE.
  • + 9 more

๐Ÿ“ฆ benchmark

  • RTX 4090
  • processor clock @ 2520 MHz
  • Intel Icelake server @ 2100 MHz
  • Driver 551.86
  • Windows 10 21H2 (19044.1415)
  • TensorRT 9.2.0
  • VapourSynth-Classic R57.A8, vapoursynth-plugin v0.96g3
  • 1920x1080 rgbs, CUDA graphs enabled, fp16
  • + 1 more

๐Ÿ“ฆ general

  • | model | 1 stream | 2 streams | 3 streams |
  • |:-----------------------------------------------:|:--------------:|:--------------:|:---------------:|
  • | dpir gray | 21.93/1757.352 | 25.48/3049.696 | 25.31/4342.044 |
  • | dpir color | 18.24/1790.184 | 25.11/3115.360 | 25.22/4440.540 |
  • | | | | |
  • | waifu2x upconv_7_{anime_style_art_rgb, photo} | 19.58/2148.716 | 39.87/3867.240 | 59.94/5585.768 |
  • | waifu2x upresnet10 | 17.40/1655.144 | 34.22/2880.096 | 42.78/4105.048 |
  • | waifu2x cunet / cugan | 13.64/4391.292 | 25.09/8346.248 | 25.19/12301.208 |
  • + 3 more

๐Ÿ“ฆ rife

  • v2, fp16 i/o
  • | version | 1 stream | 2 streams | 3 streams | 4 streams | 5 streams |
  • |:-----------------------------------------------:|:--------------:|:--------------:|:---------------:|:---------------:|:---------------:|
  • | v4.4-v4.5 | 150.20/622.784 | 301.05/835.860 | 448.90/1053.024 | 615.84/1268.152 | 787.57/1481.224 |
  • | v4.6 | 147.63/624.832 | 294.53/837.904 | 452.26/1055.072 | 603.63/1270.200 | 764.31/1485.320 |
  • | v4.7-v4.9 | 132.06/747.712 | 268.63/1075.476 | 403.54/1405.284 | 494.98/1737.152 | 496.41/2064.908 |
  • | v4.10-v4.15 | 119.09/862.400 | 238.68/1304.852 | 346.98/1749.352 | 349.48/2195.904 | 349.80/2638.356 |
  • | {v4.12, v4.13, v4.15, v4.16}_lite | 123.72/782.528 | 250.81/1151.252 | 377.27/1522.020 | 403.14/1894.844 | 403.79/2263.568 |
  • + 4 more
v14.test2: latest TensorRT libraryv14.test2Pre-release
github-actions[bot]github-actions[bot]ยท2y agoยทOctober 23, 2023
GitHub

๐Ÿ“‹ Changes

  • Same as `v14.test` release, it requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.
  • TensorRT 9.1.0 is officially documented as `for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopperโ„ข Superchip only` on Linux. The Windows build is downloaded from [here](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/9.1.0/tars/tensorrt-9.1.0.4.windows10.x86_64.cuda-12.2.llm.beta.zip), and can be used on other GPU models.
  • ~On Windows, some users have reported crashes when using it in mpv~ ([#65](https://github.com/AmusementClub/vs-mlrt/discussions/65)). This problem occurs on an earlier version of this release, which is now fixed.
  • Add parameters `bf16` (https://github.com/AmusementClub/vs-mlrt/issues/64), `custom_env` and `custom_args` to the `TRT` backend.
  • fp16 execution of `Waifu2xModel.swin_unet_art` is more accurate, faster and uses less GPU memory than bf16 execution ([benchmark](https://github.com/AmusementClub/vs-mlrt/wiki/NVIDIA-GeForce-RTX-4090#waifu2xswin_unet_art))
  • Device memory usage of model `Waifu2xModel.swin_unet_art` is reduced compared to TensorRT 9.0.1 on A10G with 1080p input (at 2.66 fps with 7.0GB VRAM usage) with default auxiliary stream heuristic.
  • TensorRT 9.0.1 using 7 auxiliary streams compared to TensorRT 9.1.0 (3 streams) results in significantly more device memory with no performance gain.
  • Setting `max_aux_streams=3` lowers device memory usage of TensorRT 9.0.1 to ~8.9GB, and `max_aux_streams=0` corresponds to ~7.3GB usage.
  • + 17 more
v14.test: latest TensorRT libraryv14.testPre-release
github-actions[bot]github-actions[bot]ยท3y agoยทMarch 13, 2023
GitHub

๐Ÿ“‹ Changes

  • It requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.
  • Add parameters `builder_optimization_level` and `max_aux_streams` to the `TRT` backend.
  • `builder_optimization_level`: "adjust how long TensorRT should spend searching for tactics with potentially better performance" [link](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt-builder-optimization-level)
  • `max_aux_streams`: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." [link](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#within-inference-multi-streaming)
  • __It is advised to lower `max_aux_streams` to 0 on heavy models like `Waifu2xModel.swin_unet_art` to reduce memory usage.__ Check the benchmark data at the bottom.
  • Following TensorRT 8.6.1, `cudnn` tactic source of the `TRT` backend is disabled by default. `tf32` is also disabled by default in vsmlrt.py.
  • Add parameter `short_path` to the `TRT` backend, which shortens engine path and is enabled on Windows by default.
  • Model `Waifu2xModel.swin_unet_art` seems does not work with `builder_optimization_level=5` from the `TRT` backend before TRT 9.0. Use `builder_optimization_level=4` or lower instead.
  • + 5 more