ashvardanian/ParallelReductionsBenchmark
Thrust, CUB, TBB, AVX2, AVX-512, CUDA, OpenCL, OpenMP, Metal, and Rust - all it takes to sum a lot of numbers fast!
21 Releases
Latest: 11mo ago
Release v0.7.2v0.7.2Latest
📦 Patch
- Make: Correct Metal/OpenCL associations (b282335)
- Make: Freeze Rust toolchain to `nightly` (9de592d)
Release v0.7.1v0.7.1
📦 Patch
- Improve: `ChillOut` to send/sync pointers (a0986f2)
Release v0.7.0v0.7.0
📦 Minor
- Add: `oneapi::tbb::task_arena` backend (bb6602b)
📦 Patch
- Docs: Put build notes in the end (dc6e246)
- Make: Fetch CLI options (33c6b59)
- Fix: Guard OpenMP calls (2056636)
- Fix: `serial` / `unrolled` kernel naming (ff715f6)
Release v0.6.1v0.6.1
📦 Patch
- Docs: Apple M2 Pro small results (9788a53)
- Make: Bump Fork Union with CrossBeam (860b5eb)
- Improve: Use `std::hardware_destructive_interference_size` (13e6cf9)
Release v0.6.0v0.6.0
📦 Minor
- Add: Fork Union Rust benchmarks (9ec7f0c)
📦 Patch
- Docs: Rust results vs Rayon, Tokio, Smol.rs (9cd6268)
- Fix: Wait for Smol.rs results (48ab934)
- Fix: Cast pointers to `usize` (ed113e9)
- Improve: Reuse partial sums buffer (ca2b2c6)
- Make: Pass small size to C++ bench (1d17f93)
- Fix: Rust drafts (8e11aea)
- Make: Rust draft (b3ddf30)
Release v0.5.3v0.5.3
📦 Patch
- Make: Reuse Taskflow executor (2daf80c)
- Make: Skip Taskflow example builds (450093e)
- Fix: Unrolled loop length (7c1b859)
Release v0.5.2v0.5.2
📦 Patch
- Docs: Polish namespaces! (5fa1c2b)
Release v0.5.1v0.5.1
📦 Patch
- Docs: Shrink logs column widths (7ec74f8)
Release v0.5.0v0.5.0
📦 Minor
- Add: Taskflow backend (0514627)
📦 Patch
- Docs: Graviton Taskflow vs OpenMP vs Fork Union (c1862d9)
- Docs: OpenMP vs Fork Union (8ed7080)
Release v0.4.4v0.4.4
📦 Patch
- Improve: Over-align partial sums to avoid false-sharing (c2b3284)
- Improve: Slicing with `scalars_per_core` (1259361)
Release v0.4.3v0.4.3
📦 Patch
- Make: VS Code launchers for macOS (1198e49)
- Fix: Avoid deprecated 32-bit API (71de4ea)
- Fix: `openmp_gt` start offset (d19038f)
- Docs: Apple M2 Pro results (a8d8360)
- Docs: Compiling on macOS (c86a8dd)
- Make: Skip BLAS on macOS (e73f155)
Release v0.4.2v0.4.2
📦 Patch
- Fix: OpenMP split size (d83bb02)
- Fix: Use <Accelerate> on macOS (e4f601b)
- Make: Avoid CUDA on non-Linux builds (975d013)
Release v0.4.1v0.4.1
📦 Patch
- Docs: Single-socket Graviton4 results (c923a60)
v0.4: Arm NEON, SVE, and OpenMP-like Poolsv0.4.0
📦 Minor
- Add: `fork_union` parallel version (c40b7f3)
- Add: Generic OpenMP pool (cce8be4)
- Add: NEON and SVE kernels (32d7d3e)
📦 Patch
- Docs: Remove namespace nesting (5e73225)
- Make: Formatting CMake (82119ea)
- Docs: Ordering header notes (a7891c7)
Release v0.3.5v0.3.5
📦 Patch
- Docs: Mark programming language used (8680dcc)
v0.3.4: Nvidia H200 benchmarksv0.3.4
📦 Patch
- Docs: H200 benchmarks (4b877b9)
- Fix: Missing `blasint` type (fad513b)
Release v0.3.3v0.3.3
📦 Patch
- Fix: Measuring AVX-512 throughput (a301c39)
Release v0.3.2v0.3.2
📦 Patch
- Docs: Need CUDA Toolkit (6ca84ed)
- Improve: Disambiguate buffer length (86a9794)
- Fix: `dataset_t` move-constructor (434cad7)
- Fix: `memset` entire dataset (e70d6a8)
- Fix: Pattern-matching version in CMake (458342d)
Release v0.3.1v0.3.1
📦 Patch
- Docs: List notable features (c340f4c)
Release v0.3.0v0.3.0
📦 Minor
- Add: BLAS with zero-stride (c39768e)
- Add: Latency Hiding in AVX-512 (9a5a61a)
📦 Patch
- Fix: `fmt::print` thousand separator (6250d0f)
- Fix: Detecting NUMA (8da88f6)
- Improve: Uniform benchmark naming (567d2e8)
- Improve: Scaling CUDA kernels (e57eff0)
- Make: Find `libnuma` (d0f74e4)
- Improve: Huge Pages and code style (aeabdce)
- Improve: `setzero` over `set1` intrinsics (1fbcd94)
- Fix: Handling SSE tail (2731049)
- + 3 more
Release v0.2.0v0.2.0
📦 Minor
- Add: Metal draft (db0dc92)
📦 Patch
- Docs: File headers (e1ac216)
- Make: Switch to `.cuh` extension for CUDA (37c0581)
- Make: Drop deprecated assets (f311a61)
- Fix: Missing `fmt` include for OpenCL (0c0bc0f)
- Fix: Feature-testing `__cpp_lib_execution` (0c80b36)
- Make: Switch to `.cuh` extension for CUDA (818851f)
