Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference
A high-performance inference system for large language models, designed for production environments
A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis
A highly optimised C++ library for mathematical applications and neural networks