Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.
BSD-3-CLAUSE License
Statistics for this project are still being loaded, please check back later.
使用yolov8、fast-reid、deepsort完成目标跟踪
CUDA C++ Core Libraries
A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-L...
Low-latency CUDA JPEG decoder by parallelizing Huffman decoding
Implementation of the Apriori and Eclat algorithms, two of the best-known basic algorithms for mi...
Instant-ngp in pytorch+cuda trained with pytorch-lightning (high quality with high speed, with on...
Superfast CUDA implementation of Word2Vec and Latent Dirichlet Allocation (LDA)
🎉 Modern CUDA Learn Notes with PyTorch: fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, ...
Real-time dense visual SLAM system
Real-time large scale dense visual SLAM system
A high-performance inference system for large language models, designed for production environments.
A curated list of awesome GPGPU (CUDA/OpenCL/Vulkan) resources
Some CUDA design patterns and a bit of template magic for CUDA
LLaMa 7b with CUDA acceleration implemented in rust. Minimal GPU memory needed!
An architecture for LLMs' continual-learning and long-term memories