5.7. ROC Ufuncs and Generalized Ufuncs¶

This page describes the ROC ufunc-like object.

To support the programming pattern of ROC programs, ROC Vectorize and GUVectorize cannot produce a conventional ufunc. Instead, a ufunc-like object is returned. This object is a close analog but not fully compatible with a regular NumPy ufunc. The ROC ufunc adds support for passing intra-device arrays (already on the GPU device) to reduce traffic over the PCI-express bus. It also accepts a stream keyword for launching in asynchronous mode.

5.7.1. Basic ROC UFunc Example¶

import math
from numba import vectorize
import numpy as np

@vectorize(['float32(float32, float32, float32)',
            'float64(float64, float64, float64)'],
           target='roc')
def roc_discriminant(a, b, c):
    return math.sqrt(b ** 2 - 4 * a * c)

N = 10000
dtype = np.float32

# prepare the input
A = np.array(np.random.sample(N), dtype=dtype)
B = np.array(np.random.sample(N) + 10, dtype=dtype)
C = np.array(np.random.sample(N), dtype=dtype)

D = roc_discriminant(A, B, C)

print(D)  # print result

5.7.2. Calling Device Functions from ROC UFuncs¶

All ROC ufunc kernels have the ability to call other ROC device functions:

from numba import vectorize, roc

# define a device function
@roc.jit('float32(float32, float32, float32)', device=True)
def roc_device_fn(x, y, z):
    return x ** y / z

# define a ufunc that calls our device function
@vectorize(['float32(float32, float32, float32)'], target='roc')
def roc_ufunc(x, y, z):
    return roc_device_fn(x, y, z)

5.7.3. Generalized ROC ufuncs¶

Generalized ufuncs may be executed on the GPU using ROC, analogous to the ROC ufunc functionality. This may be accomplished as follows:

from numba import guvectorize

@guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],
             '(m,n),(n,p)->(m,p)', target='roc')
def matmulcore(A, B, C):
    ...

5.7.4. Async execution: A Chunk at a Time¶

Partitioning your data into chunks allows computation and memory transfer to be overlapped. This can increase the throughput of your ufunc and enables your ufunc to operate on data that is larger than the memory capacity of your GPU. For example:

import math
from numba import vectorize, roc
import numpy as np

# the ufunc kernel
def discriminant(a, b, c):
    return math.sqrt(b ** 2 - 4 * a * c)

roc_discriminant = vectorize(['float32(float32, float32, float32)'],
                            target='roc')(discriminant)

N = int(1e+8)
dtype = np.float32

# prepare the input
A = np.array(np.random.sample(N), dtype=dtype)
B = np.array(np.random.sample(N) + 10, dtype=dtype)
C = np.array(np.random.sample(N), dtype=dtype)
D = np.zeros(A.shape, dtype=A.dtype)

# create a ROC stream
stream = roc.stream()

chunksize = 1e+6
chunkcount = N // chunksize

# partition numpy arrays into chunks
# no copying is performed
sA = np.split(A, chunkcount)
sB = np.split(B, chunkcount)
sC = np.split(C, chunkcount)
sD = np.split(D, chunkcount)

device_ptrs = []

# helper function, async requires operation on coarsegrain memory regions
def async_array(arr):
    coarse_arr = roc.coarsegrain_array(shape=arr.shape, dtype=arr.dtype)
    coarse_arr[:] = arr
    return coarse_arr

with stream.auto_synchronize():
    # every operation in this context with be launched asynchronously
    # by using the ROC stream

    dchunks = [] # holds the result chunks

    # for each chunk
    for a, b, c, d in zip(sA, sB, sC, sD):
        # create coarse grain arrays
        asyncA = async_array(a)
        asyncB = async_array(b)
        asyncC = async_array(c)
        asyncD = async_array(d)

        # transfer to device
        dA = roc.to_device(asyncA, stream=stream)
        dB = roc.to_device(asyncB, stream=stream)
        dC = roc.to_device(asyncC, stream=stream)
        dD = roc.to_device(asyncD, stream=stream, copy=False) # no copying

        # launch kernel
        roc_discriminant(dA, dB, dC, out=dD, stream=stream)

        # retrieve result
        dD.copy_to_host(asyncD, stream=stream)

        # store device pointers to prevent them from freeing before
        # the kernel is scheduled
        device_ptrs.extend([dA, dB, dC, dD])

        # store result reference
        dchunks.append(asyncD)

# put result chunks into the output array 'D'
for i, result in enumerate(dchunks):
    sD[i][:] = result[:]

# data is ready at this point inside D
print(D)