Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.
APACHE-2.0 License
Bot releases are visible (Hide)
Top: upcoming MPM engine that runs on CPU and GPU using rust-cuda, Bottom: toy path tracer that can run on CPU, GPU, and GPU (hardware raytracing) using recent experiments with OptiX
Today marks an exciting milestone for the Rust CUDA Project, over the past couple of months, we have made significant advancements in supporting many of the fundamental CUDA ecosystem libraries. The main changes in this release are the changes to cust to make future library support possible, but we will also be highlighting some of the WIP experiments we have been conducting.
This release is likely to be the biggest and most breaking change to cust ever, we had to fundamentally rework how many things work to:
Therefore this release is guaranteed to break your code, however, the changes should not break too much unless you did a lot of lower-level work with device memory constructs.
This release is gigantic, so here are the main things you need to worry about:
Context::create_and_push(FLAGS, device)
-> Context::new(device)
.
Module::from_str(PTX)
-> Module::from_ptx(PTX, &[])
.
The way that contexts are handled in cust has been completely overhauled, it now
uses primary context handling instead of the normal driver API context APIs. This
is aimed at future-proofing cust for libraries such as cuBLAS and cuFFT, as well as
overall simplifying the context handling APIs. This does mean that the API changed a bit:
create_and_push
is now new
and it only takes a device, not a device and flags.set_flags
is now used for setting context flags.ContextStack
, UnownedContext
, and other legacy APIs are gone.The old context handling is fully present in cust::context::legacy
for anyone who needs it for specific reasons. If you use quick_init
you don't need to worry about
any breaking changes, the API is the same.
cust_core
DeviceCopy
has now been split into its own crate, cust_core
. The crate is #![no_std]
, which allows you to
pull in cust_core
in GPU crates for deriving DeviceCopy
without cfg shenanigans.
DeviceBox::wrap
, use DeviceBox::from_raw
.DeviceSlice::as_ptr
and DeviceSlice::as_mut_ptr
. Use DeviceSlice::as_device_ptr
then DevicePointer::as_(mut)_ptr
.DeviceSlice::chunks
and consequently DeviceChunks
.DeviceSlice::chunks_mut
and consequently DeviceChunksMut
.DeviceSlice::from_slice
and DeviceSlice::from_slice_mut
because it was unsound.DevicePointer::as_raw_mut
(use DevicePointer::as_mut_ptr
).DevicePointer::wrap
(use DevicePointer::from_raw
).DeviceSlice
no longer implements Index
and IndexMut
, switching away from [T]
made this impossible to implement.DeviceSlice::index
which behaves the same.vek
is no longer re-exported.Module::from_str
, use Module::from_ptx
and pass &[]
for options.Module::load_from_string
, use Module::from_ptx_cstr
.cust::memory::LockedBox
, same as LockedBuffer
except for single elements.cust::memory::cuda_malloc_async
.cust::memory::cuda_free_async
.impl AsyncCopyDestination<LockedBox<T>> for DeviceBox<T>
for async HtoD/DtoH memcpy.DeviceBox::new_async
.DeviceBox::drop_async
.DeviceBox::zeroed_async
.DeviceBox::uninitialized_async
.DeviceBuffer::uninitialized_async
.DeviceBuffer::drop_async
.DeviceBuffer::zeroed
.DeviceBuffer::zeroed_async
.DeviceBuffer::cast
.DeviceBuffer::try_cast
.DeviceSlice::set_8
and DeviceSlice::set_8_async
.DeviceSlice::set_16
and DeviceSlice::set_16_async
.DeviceSlice::set_32
and DeviceSlice::set_32_async
.DeviceSlice::set_zero
and DeviceSlice::set_zero_async
.bytemuck
feature which is enabled by default.impl_mint
.impl_half
.impl_glam
.cust::external::ExternalMemory
.DeviceBuffer::as_slice
.DeviceVariable
, a simple wrapper around DeviceBox<T>
and T
which allows easy management of a CPU and GPU version of a type.DeviceMemory
, a trait describing any region of GPU memory that can be described with a pointer + a length.memcpy_htod
, a wrapper around cuMemcpyHtoD_v2
.mem_get_info
to query the amount of free and total memory.DevicePointer::as_ptr
and DevicePointer::as_mut_ptr
for *const T
and *mut T
.DevicePointer::from_raw
for CUdeviceptr -> DevicePointer<T>
with a safe function.DevicePointer::cast
.cust_core
for DeviceCopy
.ModuleJitOption
, JitFallback
, JitTarget
, and OptLevel
for specifying options when loading a module. Note thatModuleJitOption::MaxRegisters
does not seem to work currently, but NVIDIA is looking into it.nvcc --cubin foo.ptx -maxrregcount=REGS
Module::from_fatbin
.Module::from_cubin
.Module::from_ptx
and Module::from_ptx_cstr
.Stream
, Module
, Linker
, Function
, Event
, UnifiedBox
, ArrayObject
, LockedBuffer
, LockedBox
, DeviceSlice
, DeviceBuffer
, and DeviceBox
all now impl Send
and Sync
, this makeszeroed
functions on DeviceBox
and others are no longer unsafe and instead now require T: Zeroable
. The functions are only available with the bytemuck
feature.Stream::add_callback
now internally uses cuLaunchHostFunc
anticipating the deprecation and removal of cuStreamAddCallback
per the driver docs. This does however mean that the function no longer takes a device status as a parameter and does not execute on context error.Linker::complete
now only returns the built cubin, and not the cubin and a duration.vek
for implementing DeviceCopy are now impl_cratename
, e.g. impl_vek
, impl_half
, etc.DevicePointer::as_raw
now returns a CUdeviceptr
instead of a *const T
.num-complex
integration is now behind impl_num_complex
, not num-complex
.DeviceBox
now requires T: DeviceCopy
(previously it didn't but almost all its methods did).DeviceBox::from_raw
now takes a CUdeviceptr
instead of a *mut T
.DeviceBox::as_device_ptr
now requires &self
instead of &mut self
.DeviceBuffer
now requires T: DeviceCopy
.DeviceBuffer
is now repr(C)
and is represented by a DevicePointer<T>
and a usize
.DeviceSlice
now requires T: DeviceCopy
.DeviceSlice
is now represented as a DevicePointer<T>
and a usize
(and is repr(C)) instead of [T]
which was definitely unsound.DeviceSlice::as_ptr
and DeviceSlice::as_ptr_mut
now both return a DevicePointer<T>
.DeviceSlice
is now Clone
and Copy
.DevicePointer::as_raw
now returns a CUdeviceptr
, not a *const T
(use DevicePointer::as_ptr
).CudaError
, InvalidSouce
is now InvalidSource
, no more invalid sauce 🍅🥣The libnvvm codegen can now generate line tables while optimizing (previously it could generate debug info but not optimize), which allows you to debug and profile kernels much better in tools like Nsight Compute. You can enable debug info creation using .debug(DebugInfo::LineTables)
with cuda_builder
.
Using the generous work of @anderslanglands, we were able to get rust-cuda to target hardware raytracing completely in rust (both for the host and the device). The toy path tracer example has been ported to be able to use hardware rt as a backend, however, optix
and optix_device
are not published on crates.io yet since they are still highly experimental.
using hardware rt to render a simple mesh
Work on supporting cuBLAS through a high-level wrapper library has started, a lot of work needed to be done in cust to interop with cuBLAS which is a runtime API based library. This required some changes with how cust handles contexts to avoid dropping context resources cuBLAS was using. The library is not yet published but eventually will be once it is more complete. cuBLAS is a big piece of neural network training on the GPU so it is critical to support it.
@frjnn has been generously working on wrapping the cuDNN library. cuDNN is the primary tool used to train neural networks on the GPU, and the primary tool used by pytorch and tensorflow. High level bindings to cuDNN are a major step to making Machine Learning in Rust a viable option. This work is still very in-progress so it is not published yet, it will be published once it is usable and will likely first be used in neuronika for GPU neural network training.
Work on supporting GPU-side atomics in cuda_std has started, some preliminary work is already published in cuda_std, however, it is still very in-progress and subject to change. Atomics are a difficult issue due to the vast amount of options available for GPU atomics, including:
You can read more about it here.
Published by RDambrosio016 almost 3 years ago
This release marks the start of fixing many of the fundamental issues in the codegen, as well as implementing some of the most needed features for writing performant kernel.
This release mostly covers quality of life changes, bug fixes, and some performance improvements.
Required nightly has been updated to 12/4/21, This fixes rust-analyzer not working sometimes.
DCE has been implemented, we switched to an alternative way of linking together dependencies which now drastically reduces the amount of work libnvvm has to do, as well as removes any globals or functions not directly or indirectly used by kernels. This reduced the PTX size of the path tracer example from about 20kloc to 2.3 kloc.
CUDA Address Spaces have been mostly implemented, any user-defined static that does not rely on interior mutability will be placed in the constant address space (__constant__
), otherwise it will be placed in the generic address space (which is global for globals). This also allowed us to implement basic static shared memory support.
The codegen automatically overrides calls to libm with calls to libdevice. This is to allow existing no_std crates to take advantage of architecture-optimized math intrinsics. This can be disabled from cuda_builder if you need strict determinism. This also reduces PTX size a good amount in math-heavy kernels (3.8kloc to 2.3kloc in our path tracer). It also reduces register usage by a little bit, which can yield performance gains.
cuda_std::ptr
.#[externally_visible]
for making sure the codegen does not eliminate a function if not used by a kernel#[address_space(...)]
for making the codegen put a static in a specific address space, mostly internal and unsafe.cuda_std::shared_array!
Cust 0.2 was actually released some time ago but these were the changes in 0.2 and 0.2.1:
Device::as_raw
.MemoryAdvise
for unified memory advising.MemoryAdvise::prefetch_host
and MemoryAdvise::prefetch_device
for telling CUDA to explicitly fetch unified memory somewhere.MemoryAdvise::advise_read_mostly
.MemoryAdvise::preferred_location
and MemoryAdvise::unset_preferred_location
.StreamFlags::NON_BLOCKING
has been temporarily disabled because of soundness concerns.GpuBox::as_device_ptr
and GpuBuffer::as_device_ptr
to take &self
instead of &mut self
.DBuffer
-> DeviceBuffer
. This is how it was in rustacuda, but it was changedDBox
-> DeviceBox
.DSlice
-> DeviceSlice
.GpuBox::as_device_ptr_mut
and GpuBuffer::as_device_ptr_mut
.vek
default feature.vek
feature now uses default-features = false
, this also means Rgb
and Rgba
no longer implement DeviceCopy
.