Even though Numba can automatically transfer NumPy arrays to the device, it can only do so conservatively by always transferring device memory back to the host when a kernel finishes. To avoid the unnecessary transfer for read-only arrays, you can use the following APIs to manually control the transfer:
numba.cuda.
device_array
(shape, dtype=np.float, strides=None, order='C', stream=0)Allocate an empty device ndarray. Similar to numpy.empty()
.
numba.cuda.
device_array_like
(ary, stream=0)Call cuda.devicearray() with information from the array.
numba.cuda.
to_device
(obj, stream=0, copy=True, to=None)Allocate and transfer a numpy ndarray or structured scalar to the device.
To copy host->device a numpy array:
ary = np.arange(10)
d_ary = cuda.to_device(ary)
To enqueue the transfer to a stream:
stream = cuda.stream()
d_ary = cuda.to_device(ary, stream=stream)
The resulting d_ary
is a DeviceNDArray
.
To copy device->host:
hary = d_ary.copy_to_host()
To copy device->host to an existing array:
ary = np.empty(shape=d_ary.shape, dtype=d_ary.dtype)
d_ary.copy_to_host(ary)
To enqueue the transfer to a stream:
hary = d_ary.copy_to_host(stream=stream)
In addition to the device arrays, Numba can consume any object that implements cuda array interface. These objects also can be manually converted into a Numba device array by creating a view of the GPU buffer using the following APIs:
numba.cuda.
as_cuda_array
(obj)Create a DeviceNDArray from any object that implements the cuda-array-interface.
A view of the underlying GPU buffer is created. No copying of the data is done. The resulting DeviceNDArray will acquire a reference from obj.
numba.cuda.
is_cuda_array
(obj)Test if the object has defined the __cuda_array_interface__.
Does not verify the validity of the interface.
Device array references have the following methods. These methods are to be called in host code, not within CUDA-jitted functions.
numba.cuda.cudadrv.devicearray.
DeviceNDArray
(shape, strides, dtype, stream=0, writeback=None, gpu_data=None)An on-GPU array type
copy_to_host
(*args, **kws)Copy self
to ary
or create a new Numpy ndarray
if ary
is None
.
If a CUDA stream
is given, then the transfer will be made
asynchronously as part as the given stream. Otherwise, the transfer is
synchronous: the function returns after the copy is finished.
Always returns the host array.
Example:
import numpy as np
from numba import cuda
arr = np.arange(1000)
d_arr = cuda.to_device(arr)
my_kernel[100, 100](d_arr)
result_array = d_arr.copy_to_host()
is_c_contiguous
()Return true if the array is C-contiguous.
is_f_contiguous
()Return true if the array is Fortran-contiguous.
ravel
(order='C', stream=0)Flatten the array without changing its contents, similar to
numpy.ndarray.ravel()
.
reshape
(*newshape, **kws)Reshape the array without changing its contents, similarly to
numpy.ndarray.reshape()
. Example:
d_arr = d_arr.reshape(20, 50, order='F')
Note
DeviceNDArray defines the cuda array interface.
numba.cuda.
pinned
(*args, **kws)A context manager for temporary pinning a sequence of host ndarrays.
numba.cuda.
pinned_array
(shape, dtype=np.float, strides=None, order='C')Allocate a np.ndarray with a buffer that is pinned (pagelocked). Similar to np.empty().
numba.cuda.
stream
()Create a CUDA stream that represents a command queue for the device.
CUDA streams have the following methods:
numba.cuda.cudadrv.driver.
Stream
(context, handle, finalizer)auto_synchronize
(*args, **kwds)A context manager that waits for all commands in this stream to execute and commits any pending memory transfers upon exiting the context.
synchronize
()Wait for all commands in this stream to execute. This will commit any pending memory transfers.
Local memory is an area of memory private to each thread. Using local memory helps allocate some scratchpad area when scalar local variables are not enough. The memory is allocated once for the duration of the kernel, unlike traditional dynamic memory management.
numba.cuda.local.
array
(shape, type)Allocate a local array of the given shape and type on the device. shape is either an integer or a tuple of integers representing the array’s dimensions and must be a simple constant expression. type is a Numba type of the elements needing to be stored in the array. The array is private to the current thread. An array-like object is returned which can be read and written to like any standard array (e.g. through indexing).
Constant memory is an area of memory that is read only, cached and off-chip, it is accessible by all threads and is host allocated. A method of creating an array in constant memory is through the use of:
numba.cuda.const.
array_like
(arr)Allocate and make accessible an array in constant memory based on array-like arr.
Numba provides an Array-like data type that manages data movement to and from the device automatically. It can be used as drop-in replacement for numpy.ndarray in most cases, and is supported by Numba’s JIT-compiler for both ‘host’ and ‘cuda’ target.
numba.
SmartArray
(obj=None, copy=True, shape=None, dtype=None, order=None, where='host')¶An array type that supports host and GPU storage.
__init__
(obj=None, copy=True, shape=None, dtype=None, order=None, where='host')¶Construct a SmartArray in the memory space defined by ‘where’. Valid invocations:
SmartArray(obj=<array-like object>, copy=<optional-true-or-false>):
to create a SmartArray from an existing array-like object. The ‘copy’ argument specifies whether to adopt or to copy it.
SmartArray(shape=<shape>, dtype=<dtype>, order=<order>)
to create a new SmartArray from scratch, given the typical NumPy array attributes.
(The optional ‘where’ argument specifies where to allocate the array initially. (Default: ‘host’)
get
(where='host')¶Return the representation of ‘self’ in the given memory space.
mark_changed
(where='host')¶Mark the given location as changed, broadcast updates if needed.
Thus, SmartArray objects may be passed as function arguments to jit-compiled functions. Whenever a cuda.jit-compiled function is being executed, it will trigger a data transfer to the GPU (unless the data are already there). But instead of transferring the data back to the host after the function completes, it leaves the data on the device and merely updates the host-side if there are any external references to that. Thus, if the next operation is another invocation of a cuda.jit-compiled function, the data does not need to be transferred again, making the compound operation more efficient (and making the use of the GPU advantagous even for smaller data sizes).
Deallocation of all CUDA resources are tracked on a per-context basis. When the last reference to a device memory is dropped, the underlying memory is scheduled to be deallocated. The deallocation does not occur immediately. It is added to a queue of pending deallocations. This design has two benefits:
The deallocation queue is flushed automatically as soon as the following events occur:
Sometimes, it is desired to defer resource deallocation until a code section ends. Most often, users want to avoid any implicit synchronization due to deallocation. This can be done by using the following context manager:
numba.cuda.
defer_cleanup
(*args, **kwds)¶Temporarily disable memory deallocation. Use this to prevent resource deallocation breaking asynchronous execution.
For example:
with defer_cleanup():
# all cleanup is deferred in here
do_speed_critical_code()
# cleanup can occur here
Note: this context manager can be nested.