5.3. Memory management¶

5.3.1. Data transfer¶

Even though Numba can automatically transfer NumPy arrays to the device, it can only do so conservatively by always transferring device memory back to the host when a kernel finishes. To avoid the unnecessary transfer for read-only arrays, you can use the following APIs to manually control the transfer:

numba.roc.device_array(shape, dtype=np.float, strides=None, order='C'): Allocate an empty device ndarray. Similar to numpy.empty().

numba.roc.device_array_like(ary): Call roc.devicearray() with information from the array.

numba.roc.to_device(obj, context, copy=True, to=None)

Allocate and transfer a numpy ndarray or structured scalar to the device.

To copy host->device a numpy array:

ary = numpy.arange(10)
d_ary = roc.to_device(ary)

The resulting d_ary is a DeviceNDArray.

To copy device->host:

hary = d_ary.copy_to_host()

To copy device->host to an existing array:

ary = numpy.empty(shape=d_ary.shape, dtype=d_ary.dtype)
d_ary.copy_to_host(ary)

5.3.1.1. Device arrays¶

Device array references have the following methods. These methods are to be called in host code, not within ROC-jitted functions.

class numba.roc.hsadrv.devicearray.DeviceNDArray(shape, strides, dtype, dgpu_data=None)

An on-dGPU array type

copy_to_host(self, ary=None, stream=None)

Copy self to ary or create a new Numpy ndarray if ary is None.

The transfer is synchronous: the function returns after the copy is finished.

Always returns the host array.

Example:

import numpy as np
from numba import hsa

arr = np.arange(1000)
d_arr = hsa.to_device(arr)

my_kernel[100, 100](d_arr)

result_array = d_arr.copy_to_host()

is_c_contiguous(self): Return true if the array is C-contiguous.

is_f_contiguous(self): Return true if the array is Fortran-contiguous.

ravel(self, order='C'): Flatten the array without changing its contents, similar to numpy.ndarray.ravel().

reshape(self, *newshape, **kws)

Reshape the array without changing its contents, similarly to numpy.ndarray.reshape(). Example:

d_arr = d_arr.reshape(20, 50, order='F')

5.3.1.2. Data Registration¶

The CPU and GPU do not share the same main memory, however, it is recommended to register a memory allocation to the HSA runtime for as a performance optimisation hint.

roc.register(*arrays)¶

array_a = numpy.arange(10)
array_b = numpy.arange(10)
with roc.register(array_a, array_b):
    some_hsa_code(array_a, array_b)

roc.deregister(*arrays)¶: Deregister every given array

5.3.2. Streams¶

numba.roc.stream()

ROC streams have the following methods:

class numba.roc.hsadrv.driver.Stream

An asynchronous stream for async API

auto_synchronize(self): A context manager that waits for all commands in this stream to execute and commits any pending memory transfers upon exiting the context.

synchronize(self): Synchronize the stream.

5.3.3. Shared memory and thread synchronization¶

A limited amount of shared memory can be allocated on the device to speed up access to data, when necessary. That memory will be shared (i.e. both readable and writable) amongst all workitems belonging to a given group and has faster access times than regular device memory. It also allows workitems to cooperate on a given solution. You can think of it as a manually-managed data cache.

The memory is allocated once for the duration of the kernel, unlike traditional dynamic memory management.

numba.roc.shared.array(shape, type)¶

Allocate a shared array of the given shape and type on the device. This function must be called on the device (i.e. from a kernel or device function). shape is either an integer or a tuple of integers representing the array’s dimensions. type is a Numba type of the elements needing to be stored in the array.

The returned array-like object can be read and written to like any normal device array (e.g. through indexing).

A common pattern is to have each workitem populate one element in the shared array and then wait for all workitems to finish using :func:` .barrier`.

numba.roc.barrier(scope)¶: The scope argument specifies the level of synchronization. Set scope to roc.CLK_GLOBAL_MEM_FENCE or roc.CLK_LOCAL_MEM_FENCE to synchronize all workitems across a workgroup when accessing global memory or local memory respectively.

numba.roc.wavebarrier()¶: Creates an execution barrier across a wavefront to force a synchronization point.