GitPedia
ashvardanian

ashvardanian/ParallelReductionsBenchmark

Thrust, CUB, TBB, AVX2, AVX-512, CUDA, OpenCL, OpenMP, Metal, and Rust - all it takes to sum a lot of numbers fast!

21 Releases
Latest: 11mo ago
Release v0.7.2v0.7.2Latest
ashvardanianashvardanian·11mo ago·July 22, 2025
GitHub

📦 Patch

  • Make: Correct Metal/OpenCL associations (b282335)
  • Make: Freeze Rust toolchain to `nightly` (9de592d)
Release v0.7.1v0.7.1
ashvardanianashvardanian·1y ago·May 29, 2025
GitHub

📦 Patch

  • Improve: `ChillOut` to send/sync pointers (a0986f2)
Release v0.7.0v0.7.0
ashvardanianashvardanian·1y ago·May 23, 2025
GitHub

📦 Minor

  • Add: `oneapi::tbb::task_arena` backend (bb6602b)

📦 Patch

  • Docs: Put build notes in the end (dc6e246)
  • Make: Fetch CLI options (33c6b59)
  • Fix: Guard OpenMP calls (2056636)
  • Fix: `serial` / `unrolled` kernel naming (ff715f6)
Release v0.6.1v0.6.1
ashvardanianashvardanian·1y ago·May 22, 2025
GitHub

📦 Patch

  • Docs: Apple M2 Pro small results (9788a53)
  • Make: Bump Fork Union with CrossBeam (860b5eb)
  • Improve: Use `std::hardware_destructive_interference_size` (13e6cf9)
Release v0.6.0v0.6.0
ashvardanianashvardanian·1y ago·May 20, 2025
GitHub

📦 Minor

  • Add: Fork Union Rust benchmarks (9ec7f0c)

📦 Patch

  • Docs: Rust results vs Rayon, Tokio, Smol.rs (9cd6268)
  • Fix: Wait for Smol.rs results (48ab934)
  • Fix: Cast pointers to `usize` (ed113e9)
  • Improve: Reuse partial sums buffer (ca2b2c6)
  • Make: Pass small size to C++ bench (1d17f93)
  • Fix: Rust drafts (8e11aea)
  • Make: Rust draft (b3ddf30)
Release v0.5.3v0.5.3
ashvardanianashvardanian·1y ago·May 19, 2025
GitHub

📦 Patch

  • Make: Reuse Taskflow executor (2daf80c)
  • Make: Skip Taskflow example builds (450093e)
  • Fix: Unrolled loop length (7c1b859)
Release v0.5.2v0.5.2
ashvardanianashvardanian·1y ago·May 19, 2025
GitHub

📦 Patch

  • Docs: Polish namespaces! (5fa1c2b)
Release v0.5.1v0.5.1
ashvardanianashvardanian·1y ago·May 19, 2025
GitHub

📦 Patch

  • Docs: Shrink logs column widths (7ec74f8)
Release v0.5.0v0.5.0
ashvardanianashvardanian·1y ago·May 19, 2025
GitHub

📦 Minor

  • Add: Taskflow backend (0514627)

📦 Patch

  • Docs: Graviton Taskflow vs OpenMP vs Fork Union (c1862d9)
  • Docs: OpenMP vs Fork Union (8ed7080)
Release v0.4.4v0.4.4
ashvardanianashvardanian·1y ago·May 19, 2025
GitHub

📦 Patch

  • Improve: Over-align partial sums to avoid false-sharing (c2b3284)
  • Improve: Slicing with `scalars_per_core` (1259361)
Release v0.4.3v0.4.3
ashvardanianashvardanian·1y ago·May 19, 2025
GitHub

📦 Patch

  • Make: VS Code launchers for macOS (1198e49)
  • Fix: Avoid deprecated 32-bit API (71de4ea)
  • Fix: `openmp_gt` start offset (d19038f)
  • Docs: Apple M2 Pro results (a8d8360)
  • Docs: Compiling on macOS (c86a8dd)
  • Make: Skip BLAS on macOS (e73f155)
Release v0.4.2v0.4.2
ashvardanianashvardanian·1y ago·May 19, 2025
GitHub

📦 Patch

  • Fix: OpenMP split size (d83bb02)
  • Fix: Use <Accelerate> on macOS (e4f601b)
  • Make: Avoid CUDA on non-Linux builds (975d013)
Release v0.4.1v0.4.1
ashvardanianashvardanian·1y ago·May 3, 2025
GitHub

📦 Patch

  • Docs: Single-socket Graviton4 results (c923a60)
v0.4: Arm NEON, SVE, and OpenMP-like Poolsv0.4.0
ashvardanianashvardanian·1y ago·May 3, 2025
GitHub

📦 Minor

  • Add: `fork_union` parallel version (c40b7f3)
  • Add: Generic OpenMP pool (cce8be4)
  • Add: NEON and SVE kernels (32d7d3e)

📦 Patch

  • Docs: Remove namespace nesting (5e73225)
  • Make: Formatting CMake (82119ea)
  • Docs: Ordering header notes (a7891c7)
Release v0.3.5v0.3.5
ashvardanianashvardanian·1y ago·February 23, 2025
GitHub

📦 Patch

  • Docs: Mark programming language used (8680dcc)
v0.3.4: Nvidia H200 benchmarksv0.3.4
ashvardanianashvardanian·1y ago·January 31, 2025
GitHub

📦 Patch

  • Docs: H200 benchmarks (4b877b9)
  • Fix: Missing `blasint` type (fad513b)
Release v0.3.3v0.3.3
ashvardanianashvardanian·1y ago·January 24, 2025
GitHub

📦 Patch

  • Fix: Measuring AVX-512 throughput (a301c39)
Release v0.3.2v0.3.2
ashvardanianashvardanian·1y ago·January 19, 2025
GitHub

📦 Patch

  • Docs: Need CUDA Toolkit (6ca84ed)
  • Improve: Disambiguate buffer length (86a9794)
  • Fix: `dataset_t` move-constructor (434cad7)
  • Fix: `memset` entire dataset (e70d6a8)
  • Fix: Pattern-matching version in CMake (458342d)
Release v0.3.1v0.3.1
ashvardanianashvardanian·1y ago·January 19, 2025
GitHub

📦 Patch

  • Docs: List notable features (c340f4c)
Release v0.3.0v0.3.0
ashvardanianashvardanian·1y ago·January 19, 2025
GitHub

📦 Minor

  • Add: BLAS with zero-stride (c39768e)
  • Add: Latency Hiding in AVX-512 (9a5a61a)

📦 Patch

  • Fix: `fmt::print` thousand separator (6250d0f)
  • Fix: Detecting NUMA (8da88f6)
  • Improve: Uniform benchmark naming (567d2e8)
  • Improve: Scaling CUDA kernels (e57eff0)
  • Make: Find `libnuma` (d0f74e4)
  • Improve: Huge Pages and code style (aeabdce)
  • Improve: `setzero` over `set1` intrinsics (1fbcd94)
  • Fix: Handling SSE tail (2731049)
  • + 3 more
Release v0.2.0v0.2.0
ashvardanianashvardanian·1y ago·January 18, 2025
GitHub

📦 Minor

  • Add: Metal draft (db0dc92)

📦 Patch

  • Docs: File headers (e1ac216)
  • Make: Switch to `.cuh` extension for CUDA (37c0581)
  • Make: Drop deprecated assets (f311a61)
  • Fix: Missing `fmt` include for OpenCL (0c0bc0f)
  • Fix: Feature-testing `__cpp_lib_execution` (0c80b36)
  • Make: Switch to `.cuh` extension for CUDA (818851f)