HSA provides an execution model similar to OpenCL. Instructions are executed in parallel by a group of hardware threads. In some way, this is similar to single-instruction-multiple-data (SIMD) model but with the convenience that the fine-grain scheduling is hidden from the programmer instead of programming with SIMD vectors as a data structure. In HSA, the code you write will be executed by multiple threads at once (often hundreds or thousands). Your solution will be modelled by defining a thread hierarchy of grid, workgroup and workitem.
Numba’s HSA support exposes facilities to declare and manage this hierarchy of threads.
HSA execution model is similar to CUDA. The memory model employed by HSA on ROC
GPUs is also similar to that of CUDA. ROC GPUs have dedicated on GPU memory,
therefore the to_device()
and copy_to_host()
etc. are required as per
CUDA.
Here’s a quick mapping of the CUDA terms to HSA:
workitem
is equivalent to a CUDA thread.workgroup
is equivalent to a CUDA thread block.grid
is equivalent to a CUDA grid.wavefront
is equivalent to a CUDA warp
.A kernel function is a GPU function that is meant to be called from CPU code. It gives it two fundamental characteristics:
At first sight, writing a HSA kernel with Numba looks very much like writing a JIT function for the CPU:
@roc.jit
def increment_by_one(an_array):
"""
Increment all array elements by one.
"""
# code elided here; read further for different implementations
A kernel is typically launched in the following way:
itempergroup = 32
groupperrange = (an_array.size + (itempergroup - 1)) // itempergroup
increment_by_one[groupperrange, itempergroup](an_array)
We notice two steps here:
increment_by_one
) and indexing it with a tuple of integers.It might seem curious to have a two-level hierarchy when declaring the number of workitem needed by a kernel. The workgroup size (i.e. number of workitem per workgroup) is often crucial:
To help deal with multi-dimensional arrays, HSA allows you to specify
multi-dimensional workgroups and grids. In the example above, you could
make itempergroup
and groupperrange
tuples of one, two
or three integers. Compared to 1D declarations of equivalent sizes,
this doesn’t change anything to the efficiency or behaviour of generated
code, but can help you write your algorithms in a more natural way.
When running a kernel, the kernel function’s code is executed by every thread once. It therefore has to know which thread it is in, in order to know which array element(s) it is responsible for (complex algorithms may define more complex responsibilities, but the underlying principle is the same).
One way is for the thread to determines its position in the grid and workgroup and manually compute the corresponding array position:
@roc.jit
def increment_by_one(an_array):
# workitem id in a 1D workgroup
tx = roc.get_local_id(0)
# workgroup id in a 1D grid
ty = roc.get_group_id(0)
# workgroup size, i.e. number of workitem per workgroup
bw = roc.get_local_size(0)
# Compute flattened index inside the array
pos = tx + ty * bw
# The above is equivalent to pos = roc.get_global_id(0)
if pos < an_array.size: # Check array boundaries
an_array[pos] += 1
Note
Unless you are sure the workgroup size and grid size is a divisor of your array size, you must check boundaries as shown above.
get_local_id()
, get_local_size()
, get_group_id()
and
get_global_id()
are special functions provided by the HSA backend for
the sole purpose of knowing the geometry of the thread hierarchy and the
position of the current workitem within that geometry.
numba.roc.
get_local_id
(dim)¶Takes the index of the dimension being queried
Returns local workitem ID in the the current workgroup for the given
dimension. For 1D workgroup, the index is an integer spanning the range
from 0 inclusive to numba.roc.get_local_size()
exclusive.
numba.roc.
get_local_size
(dim)¶Takes the index of the dimension being queried
Returns the size of the workgroup at the given dimension. The value is declared when instantiating the kernel. This value is the same for all workitems in a given kernel, even if they belong to different workgroups (i.e. each workgroups is “full”).
numba.roc.
get_group_id
(dim)¶Takes the index of the dimension being queried
Returns the workgroup ID in the grid of workgroup launched a kernel.
numba.roc.
get_global_id
(dim)¶Takes the index of the dimension being queried
Returns the global workitem ID for the given dimension. Unlike numba.roc .get_local_id(), this number is unique for all workitems in a grid.