ROC Ufuncs and Generalized Ufuncs ================================== This page describes the ROC ufunc-like object. To support the programming pattern of ROC programs, ROC Vectorize and GUVectorize cannot produce a conventional ufunc. Instead, a ufunc-like object is returned. This object is a close analog but not fully compatible with a regular NumPy ufunc. The ROC ufunc adds support for passing intra-device arrays (already on the GPU device) to reduce traffic over the PCI-express bus. It also accepts a `stream` keyword for launching in asynchronous mode. Basic ROC UFunc Example ----------------------- :: import math from numba import vectorize import numpy as np @vectorize(['float32(float32, float32, float32)', 'float64(float64, float64, float64)'], target='roc') def roc_discriminant(a, b, c): return math.sqrt(b ** 2 - 4 * a * c) N = 10000 dtype = np.float32 # prepare the input A = np.array(np.random.sample(N), dtype=dtype) B = np.array(np.random.sample(N) + 10, dtype=dtype) C = np.array(np.random.sample(N), dtype=dtype) D = roc_discriminant(A, B, C) print(D) # print result Calling Device Functions from ROC UFuncs ---------------------------------------- All ROC ufunc kernels have the ability to call other ROC device functions:: from numba import vectorize, roc # define a device function @roc.jit('float32(float32, float32, float32)', device=True) def roc_device_fn(x, y, z): return x ** y / z # define a ufunc that calls our device function @vectorize(['float32(float32, float32, float32)'], target='roc') def roc_ufunc(x, y, z): return roc_device_fn(x, y, z) Generalized ROC ufuncs ---------------------- Generalized ufuncs may be executed on the GPU using ROC, analogous to the ROC ufunc functionality. This may be accomplished as follows:: from numba import guvectorize @guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'], '(m,n),(n,p)->(m,p)', target='roc') def matmulcore(A, B, C): ... .. seealso:: :ref:`Matrix multiplication example `. Async execution: A Chunk at a Time ---------------------------------- Partitioning your data into chunks allows computation and memory transfer to be overlapped. This can increase the throughput of your ufunc and enables your ufunc to operate on data that is larger than the memory capacity of your GPU. For example:: import math from numba import vectorize, roc import numpy as np # the ufunc kernel def discriminant(a, b, c): return math.sqrt(b ** 2 - 4 * a * c) roc_discriminant = vectorize(['float32(float32, float32, float32)'], target='roc')(discriminant) N = int(1e+8) dtype = np.float32 # prepare the input A = np.array(np.random.sample(N), dtype=dtype) B = np.array(np.random.sample(N) + 10, dtype=dtype) C = np.array(np.random.sample(N), dtype=dtype) D = np.zeros(A.shape, dtype=A.dtype) # create a ROC stream stream = roc.stream() chunksize = 1e+6 chunkcount = N // chunksize # partition numpy arrays into chunks # no copying is performed sA = np.split(A, chunkcount) sB = np.split(B, chunkcount) sC = np.split(C, chunkcount) sD = np.split(D, chunkcount) device_ptrs = [] # helper function, async requires operation on coarsegrain memory regions def async_array(arr): coarse_arr = roc.coarsegrain_array(shape=arr.shape, dtype=arr.dtype) coarse_arr[:] = arr return coarse_arr with stream.auto_synchronize(): # every operation in this context with be launched asynchronously # by using the ROC stream dchunks = [] # holds the result chunks # for each chunk for a, b, c, d in zip(sA, sB, sC, sD): # create coarse grain arrays asyncA = async_array(a) asyncB = async_array(b) asyncC = async_array(c) asyncD = async_array(d) # transfer to device dA = roc.to_device(asyncA, stream=stream) dB = roc.to_device(asyncB, stream=stream) dC = roc.to_device(asyncC, stream=stream) dD = roc.to_device(asyncD, stream=stream, copy=False) # no copying # launch kernel roc_discriminant(dA, dB, dC, out=dD, stream=stream) # retrieve result dD.copy_to_host(asyncD, stream=stream) # store device pointers to prevent them from freeing before # the kernel is scheduled device_ptrs.extend([dA, dB, dC, dD]) # store result reference dchunks.append(asyncD) # put result chunks into the output array 'D' for i, result in enumerate(dchunks): sD[i][:] = result[:] # data is ready at this point inside D print(D)