4.2. CUDA Kernel API

4.2.1. Kernel declaration

The @cuda.jit decorator is used to create a CUDA kernel:

numba.cuda.jit(func_or_sig=None, argtypes=None, device=False, inline=False, bind=True, link=[], debug=False, **kws)

JIT compile a python function conforming to the CUDA Python specification. If a signature is supplied, then a function is returned that takes a function to compile. If

Parameters:
  • func_or_sig (function or numba.typing.Signature) –

    A function to JIT compile, or a signature of a function to compile. If a function is supplied, then an AutoJitCUDAKernel is returned. If a signature is supplied, then a function which takes a function to compile and returns an AutoJitCUDAKernel is returned.

    Note

    A kernel cannot have any return value.

  • device (bool) – Indicates whether this is a device function.
  • bind (bool) – Force binding to CUDA context immediately
  • link (list) – A list of files containing PTX source to link with the function
  • debug – If True, check for exceptions thrown when executing the kernel. Since this degrades performance, this should only be used for debugging purposes.
  • fastmath – If true, enables flush-to-zero and fused-multiply-add, disables precise division and square root. This parameter has no effect on device function, whose fastmath setting depends on the kernel function from which they are called.
class numba.cuda.compiler.AutoJitCUDAKernel(func, bind, targetoptions)

CUDA Kernel object. When called, the kernel object will specialize itself for the given arguments (if no suitable specialized version already exists) and launch on the device associated with the current context.

Kernel objects are not to be constructed by the user, but instead are created using the numba.cuda.jit() decorator.

inspect_asm(signature=None)

Return the generated assembly code for all signatures encountered thus far, or the LLVM IR for a specific signature if given.

inspect_llvm(signature=None)

Return the LLVM IR for all signatures encountered thus far, or the LLVM IR for a specific signature if given.

inspect_types(file=None)

Produce a dump of the Python source of this function annotated with the corresponding Numba IR and type information. The dump is written to file, or sys.stdout if file is None.

specialize(*args)

Compile and bind to the current context a version of this kernel specialized for the given args.

Individual specialized kernels are instances of numba.cuda.compiler.CUDAKernel:

class numba.cuda.compiler.CUDAKernel(llvm_module, name, pretty_name, argtypes, call_helper, link=(), debug=False, fastmath=False, type_annotation=None)

CUDA Kernel specialized for a given set of argument types. When called, this object will validate that the argument types match those for which it is specialized, and then launch the kernel on the device.

bind()

Force binding to current CUDA context

device

Get current active context

inspect_asm()

Returns the PTX code for this kernel.

inspect_llvm()

Returns the LLVM IR for this kernel.

inspect_types(file=None)

Produce a dump of the Python source of this function annotated with the corresponding Numba IR and type information. The dump is written to file, or sys.stdout if file is None.

ptx

PTX code for this kernel.

4.2.2. Intrinsic Attributes and Functions

The remainder of the attributes and functions in this section may only be called from within a CUDA Kernel.

4.2.2.1. Thread Indexing

numba.cuda.threadIdx(cls)

The thread indices in the current thread block, accessed through the attributes x, y, and z. Each index is an integer spanning the range from 0 inclusive to the corresponding value of the attribute in numba.cuda.blockDim exclusive.

numba.cuda.blockIdx(cls)

The block indices in the grid of thread blocks, accessed through the attributes x, y, and z. Each index is an integer spanning the range from 0 inclusive to the corresponding value of the attribute in numba.cuda.gridDim exclusive.

numba.cuda.blockDim(cls)

The shape of a block of threads, as declared when instantiating the kernel. This value is the same for all threads in a given kernel, even if they belong to different blocks (i.e. each block is “full”).

numba.cuda.gridDim(cls)

The shape of the grid of blocks, accressed through the attributes x, y, and z.

numba.cuda.grid(ndim)

Return the absolute position of the current thread in the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel. If ndim is 1, a single integer is returned. If ndim is 2 or 3, a tuple of the given number of integers is returned.

Computation of the first integer is as follows:

cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x

and is similar for the other two indices, but using the y and z attributes.

numba.cuda.gridsize(ndim)

Return the absolute size (or shape) in threads of the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel.

Computation of the first integer is as follows:

cuda.blockDim.x * cuda.gridDim.x

and is similar for the other two indices, but using the y and z attributes.

4.2.2.2. Memory Management

class numba.cuda.shared

Shared memory namespace.

array = <macro shared.array -> <function shared_array at 0x104dd51b8>>

Allocate a shared array of the given shape and type. shape is either an integer or a tuple of integers representing the array’s dimensions. type is a Numba type of the elements needing to be stored in the array.

The returned array-like object can be read and written to like any normal device array (e.g. through indexing).

class numba.cuda.local

Local memory namespace.

array = <macro local.array -> <function local_array at 0x104dd5230>>

Allocate a local array of the given shape and type. The array is private to the current thread, and resides in global memory. An array-like object is returned which can be read and written to like any standard array (e.g. through indexing).

class numba.cuda.const

Constant memory namespace.

array_like = <macro const.array_like -> <function const_array_like at 0x104dd52a8>>

Create a const array from ary. The resulting const array will have the same shape, type, and values as ary.

4.2.2.3. Synchronization and Atomic Operations

class numba.cuda.atomic

Namespace for atomic operations

class add

add(ary, idx, val)

Perform atomic ary[idx] += val. Supported on int32, float32, and float64 operands only.

class atomic.max

max(ary, idx, val)

Perform atomic ary[idx] = max(ary[idx], val). NaN is treated as a missing value, so max(NaN, n) == max(n, NaN) == n. Note that this differs from Python and Numpy behaviour, where max(a, b) is always a when either a or b is a NaN.

Supported on float64 operands only.

numba.cuda.syncthreads(cls)

Synchronize all threads in the same thread block. This function implements the same pattern as barriers in traditional multi-threaded programming: this function waits until all threads in the block call it, at which point it returns control to all its callers.