Perf cpp
Lightweight recording and sampling of performance counters for specific code segments directly from your C++ application.
[Quick Start](#quick-start) | [Building](#building) | [Documentation](https://jmuehlig.github.io/perf-cpp) | [System Requirements](#system-requirements) The project is written primarily in C++, distributed under the GNU Lesser General Public License v3.0 license, first published in 2023. Key topics include: cpp, cpp17, instruction-based-sampling, library, linux.
perf-cpp: Hardware Performance Monitoring for C++
Quick Start | Building | Documentation | System Requirements
perf-cpp lets you profile specific parts of your code, not the entire program.
Tools like Linux Perf, Intel® VTune™, and AMD uProf profile everything: application startup, configuration parsing, data loading, and all your helper functions.
perf-cpp is different: place start() and stop() around exactly the code you want to measure.
Profile one sorting algorithm.
Measure cache misses in your hash table lookup.
Compare two memory allocators.
Features
Built around Linux's perf subsystem, perf-cpp supports counting and sampling hardware events for specific code blocks:
- Record hardware events like
perf stat, but only around the code you care about, not the entire binary (documentation) - Calculate metrics like cycles per instruction or cache miss ratios from the counters (documentation)
- Read counter values without stopping for low-overhead measurements in tight loops (documentation)
- Sample instructions and memory accesses like
perf [mem] record, but targeted at specific functions (documentation) - Export and analyze results in your code: write samples to CSV, generate flame graphs, or correlate memory accesses with specific data structures
- Mix built-in and processor-specific events like cycles, cache misses, or vendor PMU features (documentation)
See the examples and full documentation for details.
Quick Start
Record Hardware Event Statistics
cpp#include <perfcpp/event_counter.hpp> // Initialize the counter auto event_counter = perf::EventCounter{}; // Specify hardware events to count event_counter.add({"seconds", "instructions", "cycles", "cache-misses"}); // Run the workload event_counter.start(); code_to_profile(); // <-- Statistics recorded during execution event_counter.stop(); // Print the result to the console const auto result = event_counter.result(); for (const auto [event_name, value] : result) { std::cout << event_name << ": " << value << std::endl; }
Possible output:
seconds: 0.0955897
instructions: 5.92087e+07
cycles: 4.70254e+08
cache-misses: 1.35633e+07
[!NOTE]
See the guides on recording event statistics and event statistics on multiple CPUs/threads.
Check out the hardware events documentation for built-in and processor-specific events.
Record Samples
cpp#include <perfcpp/sampler.hpp> // Create the sampler auto sampler = perf::Sampler{}; // Specify when a sample is recorded: every 50,000th cycle sampler.trigger("cycles", perf::Period{50000U}); // Specify what data is included in a sample: time, CPU ID, instruction sampler.values() .timestamp(true) .cpu_id(true) .logical_instruction_pointer(true); // Run the workload sampler.start(); code_to_profile(); // <-- Samples recorded during execution sampler.stop(); const auto samples = sampler.result(); // Export samples to CSV. samples.to_csv("samples.csv"); // Or access samples programmatically. for (const auto& record : samples) { const auto timestamp = record.metadata().timestamp().value(); const auto cpu_id = record.metadata().cpu_id().value(); const auto instruction = record.instruction_execution().logical_instruction_pointer().value(); std::cout << "Time = " << timestamp << " | CPU = " << cpu_id << " | Instruction = 0x" << std::hex << instruction << std::dec << std::endl; }
Possible output:
Time = 365449130714033 | CPU = 8 | Instruction = 0x5a6e84b2075c
Time = 365449130913157 | CPU = 8 | Instruction = 0x64af7417c75c
Time = 365449131112591 | CPU = 8 | Instruction = 0x5a6e84b2075c
Time = 365449131312005 | CPU = 8 | Instruction = 0x64af7417c75c
[!NOTE]
See the sampling guide for what data you can record.
Also check out the sampling on multiple CPUs/threads guide for parallel sampling.
Building
perf-cpp can be built as a static or shared library.
bashgit clone https://github.com/jmuehlig/perf-cpp.git cd perf-cpp cmake . -B build cmake --build build
[!NOTE]
See the building guide for CMake integration and build options.
Documentation
The full documentation is available at jmuehlig.github.io/perf-cpp.
See also: Examples | Changelog
System Requirements
- Clang / GCC with support for C++17 features.
- CMake version 3.10 or higher.
- Linux Kernel 4.0 or newer (some features require a newer kernel).
perf_event_paranoidsetting: Adjust as needed to allow access to performance counters (see the perf paranoid documentation).- Python3, if you make use of processor-specific hardware event generation.
Contributing
Contributions are welcome. Open an issue or submit a pull request.
To build and run the tests:
bashcmake . -B build -DBUILD_TESTS=ON cmake --build build --target tests ./build/bin/tests
[!NOTE]
Most tests require access to hardware performance counters viaperf_event_open. If your system restricts access (e.g., in containers or VMs), some tests will fail. See the perf paranoid documentation.
For questions or feedback: jan.muehlig@tu-dortmund.de.
Further PMU-related Projects
Other profiling tools:
- PAPI monitors CPU counters, GPUs, I/O, and more.
- Likwid is a set of command-line tools for benchmarking with an extensive wiki.
- PerfEvent is a lightweight wrapper for performance counters.
- Intel's Instrumentation and Tracing Technology lets you control Intel VTune Profiler from your code.
- Want to go lower-level? Use perf_event_open directly.
Resources on Profiling
Papers and articles about hardware performance profiling:
Academic Papers
- Quantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis (2017)
- Analyzing memory accesses with modern processors (2020)
- On the Precision of Precise Event Based Sampling (2020)
- CachePerf: A Unified Cache Miss Classifier via Hybrid Hardware Sampling (2022)
- Precise Event Sampling on AMD Versus Intel: Quantitative and Qualitative Comparison (2023)
- Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping (2024)
- Multi-level Memory-Centric Profiling on ARM Processors with ARM SPE (2024)
- Breaking the Cycle - A Short Overview of Memory-Access Sampling Differences on Modern x86 CPUs (2025)
Blog Posts
- C2C - False Sharing Detection in Linux Perf (2016)
- PMU counters and profiling basics (2018)
- Advanced profiling topics. PEBS and LBR (2018)
- The Linux perf Event Scheduling Algorithm (2019)
- Performance Speed Limits (2019)
- Detect false sharing with Data Address Profiling (2019)
- Data-type profiling for perf (2023)
- Analyze cache behavior with Perf C2C on Arm (2023)
Contributors
Showing top 5 contributors by commit count.