3.12. GPU Reduction¶
Writing a reduction algorithm for CUDA GPU can be tricky. Numba provides a
@reduce
decorator for converting a simple binary operation into a reduction
kernel.
3.12.1. @reduce
¶
Example:
import numpy
from numba import cuda
@cuda.reduce
def sum_reduce(a, b):
return a + b
A = (numpy.arange(1234, dtype=numpy.float64)) + 1
expect = A.sum() # numpy sum reduction
got = sum_reduce(A) # cuda sum reduction
assert expect == got
Lambda functions can also be used here:
sum_reduce = cuda.reduce(lambda a, b: a + b)
3.12.2. class Reduce¶
The reduce
decorator creates an instance of the Reduce
class.
(Currently, reduce
is an alias to Reduce
, but this behavior is not
guaranteed.)

class
numba.cuda.
Reduce
(functor)¶ 
__call__
(self, arr, size=None, res=None, init=0, stream=0)¶ Performs a full reduction.
Parameters:  arr – A host or device array. If a device array is given, the reduction is performed inplace and the values in the array are overwritten. If a host array is given, it is copied to the device automatically.
 size – Optional integer specifying the number of elements in
arr
to reduce. If this parameter is not specified, the entire array is reduced.  res – Optional device array into which to write the reduction result to. The result is written into the first element of this array. If this parameter is specified, then no communication of the reduction output takes place from the device to the host.
 init – Optional initial value for the reduction, the type of which
must match
arr.dtype
.  stream – Optional CUDA stream in which to perform the reduction. If no stream is specified, the default stream of 0 is used.
Returns: If
res
is specified,None
is returned. Otherwise, the result of the reduction is returned.

__init__
(self, functor)¶ Create a reduction object that reduces values using a given binary function. The binary function is compiled once and cached inside this object. Keeping this object alive will prevent recompilation.
Parameters: binop – A function to be compiled as a CUDA device function that will be used as the binary operation for reduction on a CUDA device. Internally, it is compiled using cuda.jit(device=True)
.
