CUDA-oxide: Nvidia's official Rust to CUDA compiler

· coding hardware · Source ↗

TLDR

  • Experimental Rust-to-CUDA compiler compiling idiomatic Rust directly to PTX via a custom rustc codegen backend, no DSLs or nvcc required.

Key Takeaways

  • #[cuda_module] and #[kernel] macros embed device PTX into the host binary and generate typed launch methods per kernel.
  • DisjointSlice<T> enforces aliasing safety at the type level, blocking multiple threads from writing the same index.
  • Async GPU execution is supported: compose DeviceOperation graphs, schedule across stream pools, and await with .await via tokio.
  • Build and run with cargo oxide run; lower-level cuda_launch! and load_kernel_module APIs remain available for custom workflows.
  • v0.1.0 is early alpha: expect API breakage, incomplete features, and bugs.

Hacker News Comment Review

  • Commenters previously using cudarc see this as a potential near drop-in replacement, with debate on whether build times improve since cuda-oxide bypasses nvcc/CMake entirely.
  • A key open question is host-device struct sharing without manual byte serialization, which existing Rust/CUDA workflows handle awkwardly.
  • Concern raised that Rust’s bounds checks could consume extra GPU registers, reducing kernel occupancy and concurrency.

Notable Comments

  • @foo-bar-baz529: bounds checks may increase register pressure per thread, lowering SIMT concurrency on real kernels.
  • @arpadav: enumerates four concrete safety wins over CUDA C++: drop semantics replacing cudaFree, typed kernel args vs void* pointer arrays, alias prevention via DisjointSlice, and enforced thread-index constructors.

Original | Discuss on HN