row-major matmul optimization
GPL-3.0 License
English | 简体中文
2023/08 aarch64 add cmake and mperf, try -DMPERF_ENABLE=ON
!
row-major matmul optimization tutorial
backend | armv7 | aarch64 | aarch64-int8 | cuda | cuda-int4 | vulkan | x86 |
---|---|---|---|---|---|---|---|
support | ✔️ | ✔️ | ✔️ | ✔️ | - | ✔️ | ✅ |
All backends and corresponding tutorials
backend | tutorial |
---|---|
aarch64 | GEMM 入门 |
aarch64 | GEMM caching |
aarch64-int8 | - |
armv7 | ARMv7 4x4kernel 懒人优化小实践 |
cuda | cuda 入门的正确姿势:how-to-optimize-gemm |
cuda-int4 WIP | int4 炼丹要术 |
vulkan | 如何火急火燎地上手 Vulkan |
Usage is similar for all backends:
OLD
and NEW
of makefile
to the same implementation for the first run, for example$ cd aarch64
$ cat makefile
OLD := MMult_4x4_10
NEW := MMult_4x4_10
..
will compile and run the implementation which
NEWpoint at, and copy
output_MMult_4x4_10.mto
output_new.m`$ make run
$ cat output_new.m
$ python3 -m pip install -r ../requirements.txt
$ python3 plot.py
Specific to each hardware, there are subtle differences:
NEW
may choose a different nameA. Prepare armv7/aarch64 linux development environment, Raspberry Pi/rk3399/aws arm server are all fine.
B. By default ARCH := native
, build and run directly
$ cd armv8 && make run
chgemm is an int8 gemm library.
Compared to the code in this tutorial, the differences are:
chgemm has been merged into ncnn INT8 convolution implementation.
flame referenced by x86 is the original implementation, with some differences from this repo:
x86 SSE
versionMMult_4x4_17.c
written now can reach 70% of the armv8.1 CPU peaksub_kernel
also only writes the simplest kind of assembly. Practical needs a simple adjustment;octave
was discarded (it is too troublesome to configure the environment once for embedded devices), and python
was used instead.This version is faster than NVIDIA cuBLAS
$ apt install libopenblas-dev
vulkan build depends on kompute API packaging, see vulkan build documentation for details
More about how to learn compute shader
WIP
megpeak: For measuring hardware limit performance, support arm/x86/OCL..
perf: Available in linux system tools, for system-level performance analysis and disassembly
YHs_Sample: dalao 's implementation
mperf: optimization tools