4.2. CUDA Kernel API¶
4.2.1. Kernel declaration¶
The @cuda.jit
decorator is used to create a CUDA kernel:
-
numba.cuda.
jit
(func_or_sig=None, argtypes=None, device=False, inline=False, bind=True, link=[], debug=False, **kws)¶ JIT compile a python function conforming to the CUDA Python specification. If a signature is supplied, then a function is returned that takes a function to compile. If
Parameters: - func_or_sig (function or numba.typing.Signature) –
A function to JIT compile, or a signature of a function to compile. If a function is supplied, then an
AutoJitCUDAKernel
is returned. If a signature is supplied, then a function which takes a function to compile and returns anAutoJitCUDAKernel
is returned.Note
A kernel cannot have any return value.
- device (bool) – Indicates whether this is a device function.
- bind (bool) – Force binding to CUDA context immediately
- link (list) – A list of files containing PTX source to link with the function
- debug – If True, check for exceptions thrown when executing the kernel. Since this degrades performance, this should only be used for debugging purposes.
- fastmath – If true, enables flush-to-zero and fused-multiply-add, disables precise division and square root. This parameter has no effect on device function, whose fastmath setting depends on the kernel function from which they are called.
- func_or_sig (function or numba.typing.Signature) –
-
class
numba.cuda.compiler.
AutoJitCUDAKernel
(func, bind, targetoptions)¶ CUDA Kernel object. When called, the kernel object will specialize itself for the given arguments (if no suitable specialized version already exists) and launch on the device associated with the current context.
Kernel objects are not to be constructed by the user, but instead are created using the
numba.cuda.jit()
decorator.-
inspect_asm
(signature=None)¶ Return the generated assembly code for all signatures encountered thus far, or the LLVM IR for a specific signature if given.
-
inspect_llvm
(signature=None)¶ Return the LLVM IR for all signatures encountered thus far, or the LLVM IR for a specific signature if given.
-
inspect_types
(file=None)¶ Produce a dump of the Python source of this function annotated with the corresponding Numba IR and type information. The dump is written to file, or sys.stdout if file is None.
-
specialize
(*args)¶ Compile and bind to the current context a version of this kernel specialized for the given args.
-
Individual specialized kernels are instances of
numba.cuda.compiler.CUDAKernel
:
-
class
numba.cuda.compiler.
CUDAKernel
(llvm_module, name, pretty_name, argtypes, call_helper, link=(), debug=False, fastmath=False, type_annotation=None)¶ CUDA Kernel specialized for a given set of argument types. When called, this object will validate that the argument types match those for which it is specialized, and then launch the kernel on the device.
-
bind
()¶ Force binding to current CUDA context
-
device
¶ Get current active context
-
inspect_asm
()¶ Returns the PTX code for this kernel.
-
inspect_llvm
()¶ Returns the LLVM IR for this kernel.
-
inspect_types
(file=None)¶ Produce a dump of the Python source of this function annotated with the corresponding Numba IR and type information. The dump is written to file, or sys.stdout if file is None.
-
ptx
¶ PTX code for this kernel.
-
4.2.2. Intrinsic Attributes and Functions¶
The remainder of the attributes and functions in this section may only be called from within a CUDA Kernel.
4.2.2.1. Thread Indexing¶
-
numba.cuda.
threadIdx
(cls)¶ The thread indices in the current thread block, accessed through the attributes
x
,y
, andz
. Each index is an integer spanning the range from 0 inclusive to the corresponding value of the attribute innumba.cuda.blockDim
exclusive.
-
numba.cuda.
blockIdx
(cls)¶ The block indices in the grid of thread blocks, accessed through the attributes
x
,y
, andz
. Each index is an integer spanning the range from 0 inclusive to the corresponding value of the attribute innumba.cuda.gridDim
exclusive.
-
numba.cuda.
blockDim
(cls)¶ The shape of a block of threads, as declared when instantiating the kernel. This value is the same for all threads in a given kernel, even if they belong to different blocks (i.e. each block is “full”).
-
numba.cuda.
gridDim
(cls)¶ The shape of the grid of blocks, accressed through the attributes
x
,y
, andz
.
-
numba.cuda.
grid
(ndim)¶ Return the absolute position of the current thread in the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel. If ndim is 1, a single integer is returned. If ndim is 2 or 3, a tuple of the given number of integers is returned.
Computation of the first integer is as follows:
cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
and is similar for the other two indices, but using the
y
andz
attributes.
-
numba.cuda.
gridsize
(ndim)¶ Return the absolute size (or shape) in threads of the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel.
Computation of the first integer is as follows:
cuda.blockDim.x * cuda.gridDim.x
and is similar for the other two indices, but using the
y
andz
attributes.
4.2.2.2. Memory Management¶
Shared memory namespace.
Allocate a shared array of the given shape and type. shape is either an integer or a tuple of integers representing the array’s dimensions. type is a Numba type of the elements needing to be stored in the array.
The returned array-like object can be read and written to like any normal device array (e.g. through indexing).
-
class
numba.cuda.
local
¶ Local memory namespace.
-
array
= <macro local.array -> <function local_array at 0x1079627b8>>¶ Allocate a local array of the given shape and type. The array is private to the current thread, and resides in global memory. An array-like object is returned which can be read and written to like any standard array (e.g. through indexing).
-
4.2.2.3. Synchronization and Atomic Operations¶
-
class
numba.cuda.
atomic
¶ Namespace for atomic operations
-
class
add
(ary, idx, val)¶ Perform atomic ary[idx] += val. Supported on int32, float32, and float64 operands only.
-
class
atomic.
max
(ary, idx, val)¶ Perform atomic ary[idx] = max(ary[idx], val). NaN is treated as a missing value, so max(NaN, n) == max(n, NaN) == n. Note that this differs from Python and Numpy behaviour, where max(a, b) is always a when either a or b is a NaN.
Supported on float64 operands only.
-
class
-
numba.cuda.
syncthreads
(cls)¶ Synchronize all threads in the same thread block. This function implements the same pattern as barriers in traditional multi-threaded programming: this function waits until all threads in the block call it, at which point it returns control to all its callers.