Technical
May 5, 2025
TL;DR
We’ve developed a proof-of-concept GPU library using NVIDIA’s NVRTC (Runtime Compilation for CUDA) to dynamically generate CUDA code from Nim. This approach leverages Nim’s metaprogramming capabilities to create a mini-compiler that transforms Nim’s Abstract Syntax Tree (AST) into a GPU-specific AST, which is then converted into CUDA code. The goal is to evaluate the feasibility of this approach compared to existing solutions like ICICLE, a CUDA-accelerated library.
Initial results are promising: our custom NVRTC implementation is 5.3x faster than ICICLE for a Merkle tree construction using the Poseidon2 permutation.
We plan to open source this. Stay tuned.
By runtime-compiling CUDA code, we gain flexibility, maintainability, and the ability to target multiple platforms (e.g., Metal, OpenCL, Vulkan, ROCm) with minimal changes.
This project explores the feasibility of using NVRTC to build a GPU prover for Valida, comparing it to ICICLE, a well-established CUDA-accelerated library. We’ll walk through the implementation, challenges, and performance results.
Nim to CUDA: A Mini-Compiler
The core of this project is a Nim backend that transforms Nim code into CUDA code at runtime. This is achieved through a mini-compiler that:
1. Takes Nim’s AST as input.
2. Converts it into a GPU-specific AST.
3. Generates CUDA code from the GPU-specific AST.
The implementation is part of the Constantine library. The key file for this transformation is [nim_ast_to_cuda_ast.nim]
For now we compile through Constantine’s 32-bit backend because CUDA kernels ultimately want 32-bit limbs and Nim’s compile-time VM makes on-the-fly 64 → 32-bit slicing a bit clunky. It’s more a convenience choice than a hard blocker: we could just embed the pre-sliced 32-bit constants or implement the conversion with a bit of extra effort. Full 64-bit-limb support is still on the roadmap, but given Constantine already ships an easy 32-bit switch, it hasn’t been a priority.
ICICLE is a CUDA-accelerated library designed for cryptographic primitives. After reading about AIR-ICICLE, a reimplementation of Plonky3 features on top of ICICLE, we decided to evaluate our approach against it. Unfortunately, AIR-ICICLE’s examples don’t currently show significant speedups when run on CUDA devices. This motivated us to implement our own version using NVRTC and compare performance and code complexity.
ICICLE’s Poseidon2 Example: One of ICICLE’s standout examples is their Poseidon2-based Merkle tree construction. We attempted to scale up their example to handle 2²⁹ elements but encountered issues. After reporting the problem, the ICICLE team quickly fixed it.
To ensure a fair comparison, we ported ICICLE’s exact Poseidon2 permutation and made the following adjustments:
1. Switched the field from M31 to BabyBear for both our and ICICLE codes.
2. Aligned leaf inputs: ICICLE’s demo feeds the same randomly sampled value to every leaf, so we mirrored that behavior by hard-coding 123456789 for all leaves in both implementations. This sidesteps RNG-seeding quirks and lets us compare the resulting Merkle roots directly.
3. Adjusted the input size to 2²⁹ elements.
Our NVRTC-based GPU library demonstrates significant performance improvements over ICICLE, achieving a 5.3x speedup for the Poseidon2 Merkle tree construction. This proof of concept highlights the potential of runtime-compiled CUDA code for cryptographic applications.
- Implementing the end-to-end prover.
- Supporting 64-bit limbs.
- Extending the backend to target other platforms (e.g., Metal, OpenCL).
- Further optimizing the code generation process.
If you’re interested in contributing or learning more, check out the Constantine repository and the associated PR.
Technical
May 5, 2025