The CUDA JIT is a low-level entry point to the CUDA features in Numba. It translates Python functions into PTX code which execute on the CUDA hardware. The jit decorator is applied to Python functions written in our Python dialect for CUDA. Numba interacts with the CUDA Driver API to load the PTX onto the CUDA device and execute.
Most of the CUDA public API for CUDA features are exposed in the numba.cuda module:
from numba import cuda
CUDA kernels and device functions are compiled by decorating a Python function with the jit or autojit decorators.
JIT compile a python function conforming to the CUDA-Python specification.
To define a CUDA kernel that takes two int 1D-arrays:
@cuda.jit('void(int32[:], int32[:])')
def foo(aryA, aryB):
...
Note
A kernel cannot have any return value.
To launch the cuda kernel:
griddim = 1, 2
blockdim = 3, 4
foo[griddim, blockdim](aryA, aryB)
griddim is the number of thread-block per grid. It can be:
blockdim is the number of threads per block. It can be:
The above code is equaivalent to the following CUDA-C.
dim3 griddim(1, 2);
dim3 blockdim(3, 4);
foo<<<griddim, blockdim>>>(aryA, aryB);
To access the compiled PTX code:
print foo.ptx
To define a CUDA device function that takes two ints and returns a int:
@cuda.jit('int32(int32, int32)', device=True)
def bar(a, b):
...
To force inline the device function:
@cuda.jit('int32(int32, int32)', device=True, inline=True)
def bar_forced_inline(a, b):
...
A device function can only be used inside another kernel. It cannot be called from the host.
Using bar in a CUDA kernel:
@cuda.jit('void(int32[:], int32[:], int32[:])')
def use_bar(aryA, aryB, aryOut):
i = cuda.grid(1) # global position of the thread for a 1D grid.
aryOut[i] = bar(aryA[i], aryB[i])
JIT at callsite. Function signature is not needed as this will capture the type at call time. Each signature of the kernel is cached for future use.
Note
Can only compile CUDA kernel.
Example:
import numpy
@cuda.autojit
def foo(aryA, aryB):
...
aryA = numpy.arange(10, dtype=np.int32)
aryB = numpy.arange(10, dtype=np.float32)
foo[griddim, blockdim](aryA, aryB)
In the above code, a version of foo with the signature “void(int32[:], float32[:])” is compiled.
A set of CUDA intrinsics is used to identify the current execution thread. These intrinsics are meaningful inside a CUDA kernel or device function only. A common pattern to assign the computation of each element in the output array to a thread.
For a 1D grid:
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
i = tx + bx * bw
array[i] = something(i)
For a 2D grid:
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
bw = cuda.blockDim.x
bh = cuda.blockDim.y
x = tx + bx * bw
y = ty + by * bh
array[x, y] = something(x, y)
Since these patterns are so common, there is a shorthand function to produce the same result.
For a 1D grid:
i = cuda.grid(1)
array[i] = something(i)
For a 2D grid:
x, y = cuda.grid(2)
array[x, y] = something(x, y)
Similarly, the total size of the grid in each dimension is frequently used as the stride when looping in a kernel, so there is also a shorthand function for that calculation.
For a 1D grid:
xstride = cuda.gridsize(1) # = cuda.blockDim.x * cuda.gridDim.x
For a 2D grid:
xstride, ystride = cuda.gridsize(2) # = cuda.blockDim.x * cuda.gridDim.x, cuda.blockDim.y * cuda.gridDim.y
By default, any NumPy arrays used as argument of a CUDA kernel is transferred automatically to and from the device. However, to achieve maximum performance and minimizing redundant memory transfer, user should manage the memory transfer explicitly.
Host->device transfers are asynchronous to the host. Device->host transfers are synchronous to the host. If a non-zero CUDA stream is provided, the transfer becomes asynchronous.
Allocate and transfer a numpy ndarray to the device.
To copy host->device a numpy array:
ary = numpy.arange(10)
d_ary = cuda.to_device(ary)
To enqueue the transfer to a stream:
stream = cuda.stream()
d_ary = cuda.to_device(ary, stream=stream)
The resulting d_ary is a DeviceNDArray.
To copy device->host:
hary = d_ary.copy_to_host()
To copy device->host to an existing array:
ary = numpy.empty(shape=d_ary.shape, dtype=d_ary.dtype)
d_ary.copy_to_host(ary)
To enqueue the transfer to a stream:
hary = d_ary.copy_to_host(stream=stream)
Copy self to ary or create a new numpy ndarray if ary is None.
Always returns the host array.
The following are special DeviceNDArray factories:
Allocate an empty device ndarray. Similar to numpy.empty()
Allocate a numpy.ndarray with a buffer that is pinned (pagelocked). Similar to numpy.empty().
Allocate a mapped ndarray with a buffer that is pinned and mapped on to the device. Similar to numpy.empty()
Parameters: |
|
---|
The live time of a device array is bound to the lifetime of the DeviceNDArray instance.
A CUDA stream is a command queue for the CUDA device. By specifying a stream, the CUDA API calls become asynchronous, meaning that the call may return before the command has been completed. Memory transfer instructions and kernel invocation can use CUDA stream:
stream = cuda.stream()
devary = cuda.to_device(an_array, stream=stream)
a_cuda_kernel[griddim, blockdim, stream](devary)
devary.copy_to_host(an_array, stream=stream)
# data may not be available in an_array
stream.synchronize()
# data available in an_array
Create a CUDA stream that represents a command queue for the device.
An alternative syntax is available for use with a python context:
stream = cuda.stream()
with stream.auto_synchronize():
devary = cuda.to_device(an_array, stream=stream)
a_cuda_kernel[griddim, blockdim, stream](devary)
devary.copy_to_host(an_array, stream=stream)
# data available in an_array
When the python with context exits, the stream is automatically synchronized.
We currently support cuda.syncthreads() only. It is the same as __syncthreads() in CUDA-C.
Numba supports atomic addition on uint32 and float32 arrays in shared or global memory. These are accessed via the cuda.atomic.add(array, index, value) function:
@cuda.jit('void(uint32[:], uint32[:])')
def fill_bins(indices, bin_counts):
offset = cuda.grid(1)
stride = cuda.gridDim.x * cuda.blockDim.x
for i in range(offset, indices.shape[0], stride):
cuda.atomic.add(bin_counts, indices[i], 1)
A tuple can be used for the index argument if the array is multidimensional.