CUDA-oxide: Nvidia's official Rust to CUDA compiler

May 11, 2026 · coding hardware · Source ↗

TLDR

Experimental Rust-to-CUDA compiler compiling idiomatic Rust directly to PTX via a custom rustc codegen backend, no DSLs or nvcc required.

#[cuda_module] and #[kernel] macros embed device PTX into the host binary and generate typed launch methods per kernel.
DisjointSlice<T> enforces aliasing safety at the type level, blocking multiple threads from writing the same index.
Async GPU execution is supported: compose DeviceOperation graphs, schedule across stream pools, and await with .await via tokio.
Build and run with cargo oxide run; lower-level cuda_launch! and load_kernel_module APIs remain available for custom workflows.
v0.1.0 is early alpha: expect API breakage, incomplete features, and bugs.

Commenters previously using cudarc see this as a potential near drop-in replacement, with debate on whether build times improve since cuda-oxide bypasses nvcc/CMake entirely.
A key open question is host-device struct sharing without manual byte serialization, which existing Rust/CUDA workflows handle awkwardly.
Concern raised that Rust’s bounds checks could consume extra GPU registers, reducing kernel occupancy and concurrency.

@foo-bar-baz529: bounds checks may increase register pressure per thread, lowering SIMT concurrency on real kernels.
@arpadav: enumerates four concrete safety wins over CUDA C++: drop semantics replacing cudaFree, typed kernel args vs void* pointer arrays, alias prevention via DisjointSlice, and enforced thread-index constructors.