Even though Numba can automatically transfer NumPy arrays to the device, it can only do so conservatively by always transferring device memory back to the host when a kernel finishes. To avoid the unnecessary transfer for read-only arrays, you can use the following APIs to manually control the transfer:
Copy the Numpy array to device memory. A device array reference is returned which can be passed as argument to a kernel expecting the same kind of array.
If a CUDA stream is given, then the transfer will be made asynchronously as part as the given stream. Otherwise, the transfer is synchronous: the function returns after the copy is finished.
The lifetime of the allocated device memory is managed by Numba. Once it isn’t referred to anymore, it is automatically released.
Device array references have the following methods. These methods are to be called on the host, not on the device.
Copy back contents of the device array to Numpy array on the host. If array is not given, a new array is allocated and returned.
If a CUDA stream is given, then the transfer will be made asynchronously as part as the given stream. Otherwise, the transfer is synchronous: the function returns after the copy is finished.
Example:
import numpy as np
from numba import cuda
arr = np.arange(1000)
d_arr = cuda.to_device(arr)
my_kernel[100, 100](d_arr)
result_array = d_arr.copy_to_host()
Return whether the array is C-contiguous.
Return whether the array is Fortran-contiguous.
Flatten the array without changing its contents, similarly to numpy.ndarray.ravel().
Change the array’s shape without changing its contents, similarly to numpy.ndarray.reshape(). Example:
d_arr = d_arr.reshape(20, 50, order='F')
Local memory is an area of memory private to each thread. Using local memory helps allocate some scratchpad area when scalar local variables are not enough. The memory is allocated once for the duration of the kernel, unlike traditional dynamic memory management.
Allocate a local array of the given shape and type on the device. The array is private to the current thread. An array-like object is returned which can be read and written to like any standard array (e.g. through indexing).