(2024/2025) A library and environment for parallel processing in a power-limited CPU+GPU cluster environment
Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference
Best practices & guides on how to write distributed pytorch training code
Gradient Descent Optimizers and Genetic Algorithms using GPUs, CPUs, and FPGAs via CUDA, OpenCL, and oneAPI
A nvImageCodec library of GPU- and CPU- accelerated codecs featuring a unified interface
A high-performance inference system for large language models, designed for production environments
A simple yet sufficiently fast (attenuated) Radon and backproject implementation using KernelAbstractions
Glasses detection, classification and segmentation
A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis
💣 SMH – a computer vision project for automatic, precision mortar strike calculations in Squad
Simple experimental async GPGPU framework for Rust
A highly optimised C++ library for mathematical applications and neural networks