The @cuda.jit
decorator is used to create a CUDA kernel:
numba.cuda.
jit
(func_or_sig=None, argtypes=None, device=False, inline=False, bind=True, link=[], debug=None, **kws)¶JIT compile a python function conforming to the CUDA Python specification. If a signature is supplied, then a function is returned that takes a function to compile. If
Parameters: |
|
---|
numba.cuda.compiler.
AutoJitCUDAKernel
(func, bind, targetoptions)¶CUDA Kernel object. When called, the kernel object will specialize itself for the given arguments (if no suitable specialized version already exists) & compute capability, and launch on the device associated with the current context.
Kernel objects are not to be constructed by the user, but instead are
created using the numba.cuda.jit()
decorator.
extensions
¶A list of objects that must have a prepare_args function. When a specialized kernel is called, each argument will be passed through to the prepare_args (from the last object in this list to the first). The arguments to prepare_args are:
The prepare_args function must return a tuple (ty, val), which will be passed in turn to the next right-most extension. After all the extensions have been called, the resulting (ty, val) will be passed into Numba’s default argument marshalling logic.
inspect_asm
(self, signature=None, compute_capability=None)¶Return the generated assembly code for all signatures encountered thus far, or the LLVM IR for a specific signature and compute_capability if given.
inspect_llvm
(self, signature=None, compute_capability=None)¶Return the LLVM IR for all signatures encountered thus far, or the LLVM IR for a specific signature and compute_capability if given.
inspect_types
(self, file=None)¶Produce a dump of the Python source of this function annotated with the corresponding Numba IR and type information. The dump is written to file, or sys.stdout if file is None.
specialize
(self, *args)¶Compile and bind to the current context a version of this kernel specialized for the given args.
Individual specialized kernels are instances of
numba.cuda.compiler.CUDAKernel
:
numba.cuda.compiler.
CUDAKernel
(llvm_module, name, pretty_name, argtypes, call_helper, link=(), debug=False, fastmath=False, type_annotation=None, extensions=[], max_registers=None)¶CUDA Kernel specialized for a given set of argument types. When called, this object will validate that the argument types match those for which it is specialized, and then launch the kernel on the device.
bind
(self)¶Force binding to current CUDA context
device
¶Get current active context
inspect_asm
(self)¶Returns the PTX code for this kernel.
inspect_llvm
(self)¶Returns the LLVM IR for this kernel.
inspect_types
(self, file=None)¶Produce a dump of the Python source of this function annotated with the corresponding Numba IR and type information. The dump is written to file, or sys.stdout if file is None.
ptx
¶PTX code for this kernel.
The remainder of the attributes and functions in this section may only be called from within a CUDA Kernel.
numba.cuda.
threadIdx
¶The thread indices in the current thread block, accessed through the
attributes x
, y
, and z
. Each index is an integer spanning the
range from 0 inclusive to the corresponding value of the attribute in
numba.cuda.blockDim
exclusive.
numba.cuda.
blockIdx
¶The block indices in the grid of thread blocks, accessed through the
attributes x
, y
, and z
. Each index is an integer spanning the
range from 0 inclusive to the corresponding value of the attribute in
numba.cuda.gridDim
exclusive.
numba.cuda.
blockDim
¶The shape of a block of threads, as declared when instantiating the kernel. This value is the same for all threads in a given kernel, even if they belong to different blocks (i.e. each block is “full”).
numba.cuda.
gridDim
¶The shape of the grid of blocks, accessed through the attributes x
,
y
, and z
.
numba.cuda.
laneid
¶The thread index in the current warp, as an integer spanning the range
from 0 inclusive to the numba.cuda.warpsize
exclusive.
numba.cuda.
warpsize
¶The size in threads of a warp on the GPU. Currently this is always 32.
numba.cuda.
grid
(ndim)¶Return the absolute position of the current thread in the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel. If ndim is 1, a single integer is returned. If ndim is 2 or 3, a tuple of the given number of integers is returned.
Computation of the first integer is as follows:
cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
and is similar for the other two indices, but using the y
and z
attributes.
numba.cuda.
gridsize
(ndim)¶Return the absolute size (or shape) in threads of the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel.
Computation of the first integer is as follows:
cuda.blockDim.x * cuda.gridDim.x
and is similar for the other two indices, but using the y
and z
attributes.
Creates an array in the local memory space of the CUDA kernel with
the given shape
and dtype
.
Returns an array with its content uninitialized.
Note
All threads in the same thread block sees the same array.
numba.cuda.local.
array
(shape, dtype)¶Creates an array in the local memory space of the CUDA kernel with the
given shape
and dtype
.
Returns an array with its content uninitialized.
Note
Each thread sees a unique array.
numba.cuda.const.
array_like
(ary)¶Copies the ary
into constant memory space on the CUDA kernel at compile
time.
Returns an array like the ary
argument.
Note
All threads and blocks see the same array.
numba.cuda.atomic.
add
(array, idx, value)¶Perform array[idx] += value
. Support int32, int64, float32 and
float64 only. The idx
argument can be an integer or a tuple of integer
indices for indexing into multiple dimensional arrays. The number of element
in idx
must match the number of dimension of array
.
Returns the value of array[idx]
before the storing the new value.
Behaves like an atomic load.
numba.cuda.atomic.
max
(array, idx, value)¶Perform array[idx] = max(array[idx], value)
. Support int32, int64,
float32 and float64 only. The idx
argument can be an integer or a
tuple of integer indices for indexing into multiple dimensional arrays.
The number of element in idx
must match the number of dimension of
array
.
Returns the value of array[idx]
before the storing the new value.
Behaves like an atomic load.
numba.cuda.
syncthreads
()¶Synchronize all threads in the same thread block. This function implements the same pattern as barriers in traditional multi-threaded programming: this function waits until all threads in the block call it, at which point it returns control to all its callers.
numba.cuda.
syncthreads_count
(predicate)¶An extension to numba.cuda.syncthreads
where the return value is a count
of the threads where predicate
is true.
numba.cuda.
syncthreads_and
(predicate)¶An extension to numba.cuda.syncthreads
where 1 is returned if predicate
is
true for all threads or 0 otherwise.
numba.cuda.
syncthreads_or
(predicate)¶An extension to numba.cuda.syncthreads
where 1 is returned if predicate
is
true for any thread or 0 otherwise.
Warning
All syncthreads functions must be called by every thread in the thread-block. Falling to do so may result in undefined behavior.
The memory fences are used to guarantee the effect of memory operations are visible by other threads within the same thread-block, the same GPU device, and the same system (across GPUs on global memory). Memory loads and stores are guaranteed to not move across the memory fences by optimization passes.
Warning
The memory fences are considered to be advanced API and most
usercases should use the thread barrier (e.g. syncthreads()
).
numba.cuda.
threadfence
()¶A memory fence at device level (within the GPU).
numba.cuda.
threadfence_block
()¶A memory fence at thread block level.
numba.cuda.
threadfence_system
()¶A memory fence at system level (across GPUs).
All warp level operations require at least CUDA 9. The argument membermask
is
a 32 bit integer mask with each bit corresponding to a thread in the warp, with 1
meaning the thread is in the subset of threads within the function call. The
membermask
must be all 1 if the GPU compute capability is below 7.x.
numba.cuda.
syncwarp
(membermask)¶Synchronize a masked subset of the threads in a warp.
numba.cuda.
all_sync
(membermask, predicate)¶If the predicate
is true for all threads in the masked warp, then
a non-zero value is returned, otherwise 0 is returned.
numba.cuda.
any_sync
(membermask, predicate)¶If the predicate
is true for any thread in the masked warp, then
a non-zero value is returned, otherwise 0 is returned.
numba.cuda.
eq_sync
(membermask, predicate)¶If the boolean predicate
is the same for all threads in the masked warp,
then a non-zero value is returned, otherwise 0 is returned.
numba.cuda.
ballot_sync
(membermask, predicate)¶Returns a mask of all threads in the warp whose predicate
is true,
and are within the given mask.
numba.cuda.
shfl_sync
(membermask, value, src_lane)¶Shuffles value
across the masked warp and returns the value
from src_lane
. If this is outside the warp, then the
given value
is returned.
numba.cuda.
shfl_up_sync
(membermask, value, delta)¶Shuffles value
across the masked warp and returns the value
from laneid - delta
. If this is outside the warp, then the
given value
is returned.
numba.cuda.
shfl_down_sync
(membermask, value, delta)¶Shuffles value
across the masked warp and returns the value
from laneid + delta
. If this is outside the warp, then the
given value
is returned.
numba.cuda.
shfl_xor_sync
(membermask, value, lane_mask)¶Shuffles value
across the masked warp and returns the value
from laneid ^ lane_mask
.
numba.cuda.
match_any_sync
(membermask, value, lane_mask)¶Returns a mask of threads that have same value
as the given value
from within the masked warp.
numba.cuda.
match_all_sync
(membermask, value, lane_mask)¶Returns a tuple of (mask, pred), where mask is a mask of threads that have
same value
as the given value
from within the masked warp, if they
all have the same value, otherwise it is 0. And pred is a boolean of whether
or not all threads in the mask warp have the same warp.
A subset of the CUDA Math API’s integer intrinsics are available. For further documentation, including semantics, please refer to the CUDA Toolkit documentation.
numba.cuda.
popc
()¶Returns the number of set bits in the given value.
numba.cuda.
brev
()¶Reverses the bit pattern of an integer value, for example 0b10110110 becomes 0b01101101.
numba.cuda.
clz
()¶Counts the number of leading zeros in a value.
numba.cuda.
ffs
()¶Find the position of the least significant bit set to 1 in an integer.
A subset of the CUDA Math API’s floating point intrinsics are available. For further documentation, including semantics, please refer to the single and double precision parts of the CUDA Toolkit documentation.
numba.cuda.
fma
()¶Perform the fused multiply-add operation. Named after the fma
and fmaf
in
the C api, but maps to the fma.rn.f32
and fma.rn.f64
(round-to-nearest-even)
PTX instructions.
A subset of the CUDA’s control flow instructions are directly available as
intrinsics. Avoiding branches is a key way to improve CUDA performance, and
using these intrinsics mean you don’t have to rely on the nvcc
optimizer
identifying and removing branches. For further documentation, including
semantics, please refer to the relevant CUDA Toolkit documentation.
numba.cuda.
selp
()¶Select between two expressions, depending on the value of the first
argument. Similar to LLVM’s select
instruction.