A PyTorch Library for Accelerating 3D Deep Learning Research
VUDA is a header-only library based on Vulkan that provides a CUDA Runtime API interface for writing GPU-accelerated applications
🎉 Modern CUDA Learn Notes with PyTorch: fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm
A brian2 extension to simulate spiking neural networks on GPUs
SDK for GPU accelerated genome assembly and analysis
A collection of GICP-based fast point cloud registration algorithms
A highly optimised C++ library for mathematical applications and neural networks
Canny edge detector implemented in CUDA C/C++
(2024/2025) A library and environment for parallel processing in a power-limited CPU+GPU cluster environment
Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference
Best practices & guides on how to write distributed pytorch training code
An architecture for LLMs' continual-learning and long-term memories
Templated C++/CUDA implementation of Model Predictive Path Integral Control (MPPI)
fat_llama is a Python package for upscaling audio files to FLAC or WAV formats using advanced audio processing techniques
A low-footprint GPU accelerated Speech to Text Python package for the Jetpack 5 era bolstered by an optimized graph
This is a discord bot running on llama cpp with the llama 3 model and image geneartion