A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.
OTHER License
A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.
chmod +x make.sh
and ./make.sh path/to/kernel.ptx
.ptx
file from your application; this works only with an Nvidia GPU. With the OpenCL-Wrapper, you can simply uncomment #define PTX
in src/opencl.hpp
and compile and run. A file kernel.ptx
is created, containing the PTX assembly code.bin/PTXprofiler.exe path/to/kernel.ptx
. For FluidX3D for example, this table is generated:kernel name |flops (float int bit )|copy |branch|cache (load store)|memory (load cached store)
--------------------------------|---------------------------|------|------|--------------------|---------------------------
initialize | 283 129 61 93| 33| 6| 0 0 0| 135 35 0 100
stream_collide | 363 261 35 67| 23| 2| 0 0 0| 153 77 0 76
update_fields | 160 56 37 67| 21| 2| 0 0 0| 93 77 0 16
voxelize_mesh | 170 91 34 45| 40| 11| 84 48 36| 37 36 0 1
transfer_extract_fi | 460 0 221 239| 122| 63| 0 0 0| 180 80 20 80
transfer__insert_fi | 483 0 247 236| 115| 47| 0 0 0| 180 80 20 80
transfer_extract_rho_u_flags | 47 0 39 8| 23| 1| 0 0 0| 68 34 0 34
transfer__insert_rho_u_flags | 47 0 39 8| 23| 1| 0 0 0| 68 34 0 34
flops
, but also listed separately as float
, int
and bit
.copy
.branch
.cache
, with separate counters for load
and store
.memory
, with separate counters for load
, cached
(load from VRAM or L2 cache) and store
.flops
and memory
accesses, together with the measured execution time of the kernel, to place it in a roofline model diagram.