NBEP 7: CUDA External Memory Management Plugins

Author:Graham Markall, NVIDIA
Contributors:Thomson Comer, Peter Entschev, Leo Fang, John Kirkham, Keith Kraus
Date:March 2020

Background and goals

The CUDA Array Interface enables sharing of data between different Python libraries that access CUDA devices. However, each library manages its own memory distinctly from the others. For example:

The goal of this NBEP is to describe a plugin interface that enables Numba’s internal memory management to be replaced with an external memory manager by the user. When the plugin interface is in use, Numba no longer directly allocates or frees any memory when creating arrays, but instead requests allocations and frees through the external manager.


Provide an External Memory Manager (EMM) interface in Numba.

  • When the EMM is in use, Numba will make all memory allocation using the EMM. It will never directly call functions such as CuMemAlloc, cuMemFree, etc.
  • When not using an External Memory Manager (EMM), Numba’s present behaviour is unchanged (at the time of writing, the current version is the 0.48 release).

If an EMM is to be used, it will entirely replace Numba’s internal memory management for the duration of program execution. An interface for setting the memory manager will be provided.

Device vs. Host memory

An EMM will always take responsibility for the management of device memory. However, not all CUDA memory management libraries also support managing host memory, so a facility for Numba to continue the management of host memory whilst ceding control of device memory to the EMM will be provided.

Deallocation strategies

Numba’s internal memory management uses a deallocation strategy designed to increase efficiency by deferring deallocations until a significant quantity are pending. It also provides a mechanism for preventing deallocations entirely during critical sections, using the defer_cleanup() context manager.

  • When the EMM is not in use, the deallocation strategy and operation of defer_cleanup remain unchanged.
  • When the EMM is in use, the deallocation strategy is implemented by the EMM, and Numba’s internal deallocation mechanism is not used. For example:
    • A similar strategy to Numba’s could be implemented by the EMM, or
    • Deallocated memory might immediately be returned to a memory pool.
  • The defer_cleanup context manager may behave differently with an EMM - an EMM should be accompanied by documentation of the behaviour of the defer_cleanup context manager when it is in use.
    • For example, a pool allocator could always immediately return memory to a pool even when the context manager is in use, but could choose not to free empty pools until defer_cleanup is not in use.

Management of other objects

In addition to memory, Numba manages the allocation and deallocation of events, streams, and modules (a module is a compiled object, which is generated from @cuda.jit-ted functions). The management of streams, events, and modules should be unchanged by the presence or absence of an EMM.

Asynchronous allocation / deallocation

An asynchronous memory manager might provide the facility for an allocation or free to take a CUDA stream and execute asynchronously. For freeing, this is unlikely to cause issues since it operates at a layer beneath Python, but for allocations this could be problematic if the user tries to then launch a kernel on the default stream from this asynchronous memory allocation.

The interface described in this proposal will not be required to support asynchronous allocation and deallocation, and as such these use cases will not be considered further. However, nothing in this proposal should preclude the straightforward addition of asynchronous operations in future versions of the interface.


In order to minimise complexity and constrain this proposal to a reasonable scope, the following will not be supported:

  • Using different memory manager implementations for different contexts. All contexts will use the same memory manager implementation - either the Numba internal implementation or an external implementation.
  • Changing the memory manager once execution has begun. It is not practical to change the memory manager and retain all allocations. Cleaning up the entire state and then changing to a different memory allocator (rather than starting a new process) appears to be a rather niche use case.
  • Any changes to the __cuda_array_interface__ to further define its semantics, e.g. for acquiring / releasing memory as discussed in Numba Issue #4886 - these are independent, and can be addressed as part of separate proposals.
  • Managed memory / UVM is not supported. At present Numba does not support UVM - see Numba Issue #4362 for discussion of support.

Interface for Plugin developers

New classes and functions will be added to numba.cuda.cudadrv.driver:

  • BaseCUDAMemoryManager and HostOnlyCUDAMemoryManager: base classes for EMM plugin implementations.
  • set_memory_manager: a method for registering an external memory manager with Numba.

These will be exposed through the public API, in the numba.cuda module. Additionally, some classes that are already part of the driver module will be exposed as part of the public API:

  • MemoryPointer: used to encapsulate information about a pointer to device memory.
  • MappedMemory: used to hold information about host memory that is mapped into the device address space (a subclass of MemoryPointer).
  • PinnedMemory: used to hold information about host memory that is pinned (a subclass of mviewbuf.MemAlloc, a class internal to Numba).

As an alternative to calling the set_memory_manager function, an environment variable can be used to set the memory manager. The value of the environment variable should be the name of the module containing the memory manager in its global scope, named _numba_memory_manager:

When this variable is set, Numba will automatically use the memory manager from the specified module. Calls to set_memory_manager will issue a warning, but otherwise be ignored.

Plugin Base Classes

An EMM plugin is implemented by inheriting from the BaseCUDAMemoryManager class, which is defined as:

class BaseCUDAMemoryManager(object, metaclass=ABCMeta):
    def memalloc(self, size):
        Allocate on-device memory in the current context. Arguments:

        - `size`: Size of allocation in bytes

        Returns: a `MemoryPointer` to the allocated memory.

    def memhostalloc(self, size, mapped, portable, wc):
        Allocate pinned host memory. Arguments:

        - `size`: Size of the allocation in bytes
        - `mapped`: Whether the allocated memory should be mapped into the CUDA
                    address space.
        - `portable`: Whether the memory will be considered pinned by all
                      contexts, and not just the calling context.
        - `wc`: Whether to allocate the memory as write-combined.

        Returns a `MappedMemory` or `PinnedMemory` instance that owns the
        allocated memory, depending on whether the region was mapped into
        device memory.

    def mempin(self, owner, pointer, size, mapped):
        Pin a region of host memory that is already allocated. Arguments:

        - `owner`: An object owning the memory - e.g. a `DeviceNDArray`.
        - `pointer`: The pointer to the beginning of the region to pin.
        - `size`: The size of the region to pin.
        - `mapped`: Whether the region should also be mapped into device memory.

        Returns a `MappedMemory` or `PinnedMemory` instance that refers to the
        allocated memory, depending on whether the region was mapped into device

    def initialize(self):
        Perform any initialization required for the EMM plugin to be ready to

    def get_memory_info(self):
        Returns (free, total) memory in bytes in the context

    def get_ipc_handle(self, memory):
        Return an `IpcHandle` from a GPU allocation. Arguments:

        - `memory`: A `MemoryPointer` for which the IPC handle should be created.

    def reset(self):
        Clear up all memory allocated in this context.

    def defer_cleanup(self):
        Returns a context manager that ensures the implementation of deferred
        cleanup whilst it is active.

    def interface_version(self):
        Returns an integer specifying the version of the EMM Plugin interface
        supported by the plugin implementation. Should always return 1 for
        implementations described in this proposal.

All of the methods of an EMM plugin are called from within Numba - they never need to be invoked directly by a Numba user.

The initialize method is called by Numba prior to any memory allocations being requested. This gives the EMM an opportunity to initialize any data structures, etc., that it needs for its normal operations. The method may be called multiple times during the lifetime of the program - subsequent calls should not invalidate or reset the state of the EMM.

The memalloc, memhostalloc, and mempin methods are called when Numba requires an allocation of device or host memory, or pinning of host memory. Device memory should always be allocated in the current context.

get_ipc_handle is called when an IPC handle for an array is required. Note that there is no method for closing an IPC handle - this is because the IpcHandle object constructed by get_ipc_handle contains a close() method as part of its definition in Numba, which closes the handle by calling cuIpcCloseMemHandle. It is expected that this is sufficient for general use cases, so no facility for customising the closing of IPC handles is provided by the EMM Plugin interface.

get_memory_info may be called at any time after initialize.

reset is called as part of resetting a context. Numba does not normally call reset spontaneously, but it may be called at the behest of the user. Calls to reset may even occur before initialize is called, so the plugin should be robust against this occurrence.

defer_cleanup is called when the numba.cuda.defer_cleanup context manager is used from user code.

interface_version is called by Numba when the memory manager is set, to ensure that the version of the interface implemented by the plugin is compatible with the version of Numba in use.

Representing pointers

Device Memory

The MemoryPointer class is used to represent a pointer to memory. Whilst there are various details of its implementation, the only aspect relevant to EMM plugin development is its initialization. The __init__ method has the following interface:

class MemoryPointer:
    def __init__(self, context, pointer, size, owner=None, finalizer=None):
  • context: The context in which the pointer was allocated.
  • pointer: A ctypes pointer (e.g. ctypes.c_uint64) holding the address of the memory.
  • size: The size of the allocation in bytes.
  • owner: The owner is sometimes set by the internals of the class, or used for Numba’s internal memory management, but need not be provided by the writer of an EMM plugin - the default of None should always suffice.
  • finalizer: A method that is called when the last reference to the MemoryPointer object is released. Usually this will make a call to the external memory management library to inform it that the memory is no longer required, and that it could potentially be freed (though the EMM is not required to free it immediately).

Host Memory

Memory mapped into the CUDA address space (which is created when the memhostalloc or mempin methods are called with mapped=True) is managed using the MappedMemory class:

class MappedMemory(AutoFreePointer):
    def __init__(self, context, pointer, size, owner, finalizer=None):
  • context: The context in which the pointer was allocated.
  • pointer: A ctypes pointer (e.g. ctypes.c_void_p) holding the address of the allocated memory.
  • size: The size of the allocated memory in bytes.
  • owner: A Python object that owns the memory, e.g. a DeviceNDArray instance.
  • finalizer: A method that is called when the last reference to the MappedMemory object is released. For example, this method could call cuMemFreeHost on the pointer to deallocate the memory immediately.

Note that the inheritance from AutoFreePointer is an implementation detail and need not concern the developer of an EMM plugin - MemoryPointer is higher in the MRO of MappedMemory.

Memory that is only in the host address space and has been pinned is represented with the PinnedMemory class:

class PinnedMemory(mviewbuf.MemAlloc):
    def __init__(self, context, pointer, size, owner, finalizer=None):
  • context: The context in which the pointer was allocated.
  • pointer: A ctypes pointer (e.g. ctypes.c_void_p) holding the address of the pinned memory.
  • size: The size of the pinned region in bytes.
  • owner: A Python object that owns the memory, e.g. a DeviceNDArray instance.
  • finalizer: A method that is called when the last reference to the PinnedMemory object is released. This method could e.g. call cuMemHostUnregister on the pointer to unpin the memory immediately.

Providing device memory management only

Some external memory managers will support management of on-device memory but not host memory. To make it easy to implement an EMM plugin using one of these managers, Numba will provide a memory manager class with implementations of the memhostalloc and mempin methods. An abridged definition of this class follows:

class HostOnlyCUDAMemoryManager(BaseCUDAMemoryManager):
    # Unimplemented methods:
    # - memalloc
    # - get_memory_info

    def memhostalloc(self, size, mapped, portable, wc):
        # Implemented.

    def mempin(self, owner, pointer, size, mapped):
        # Implemented.

    def initialize(self):
        # Implemented.
        # Must be called by any subclass when its initialize() method is
        # called.

    def reset(self):
        # Implemented.
        # Must be called by any subclass when its reset() method is
        # called.

    def defer_cleanup(self):
        # Implemented.
        # Must be called by any subclass when its defer_cleanup() method is
        # called.

A class can subclass the HostOnlyCUDAMemoryManager and then it only needs to add implementations of methods for on-device memory. Any subclass must observe the following rules:

  • If the subclass implements __init__, then it must also call HostOnlyCUDAMemoryManager.__init__, as this is used to initialize some of its data structures (self.allocations and self.deallocations).
  • The subclass must implement memalloc and get_memory_info.
  • The initialize and reset methods perform initialisation of structures used by the HostOnlyCUDAMemoryManager.
    • If the subclass has nothing to do on initialisation (possibly) or reset (unlikely) then it need not implement these methods.
    • However, if it does implement these methods then it must also call the methods from HostOnlyCUDAMemoryManager in its own implementations.
  • Similarly if defer_cleanup is implemented, it should enter the context provided by HostOnlyCUDAManager.defer_cleanup() prior to yielding (or in the __enter__ method) and release it prior to exiting (or in the __exit__ method).

Import order

The order in which Numba and the library implementing an EMM Plugin should not matter. For example, if rmm were to implement and register an EMM Plugin, then:

from numba import cuda
import rmm


import rmm
from numba import cuda

are equivalent - this is because Numba does not initialize CUDA or allocate any memory until the first call to a CUDA function - neither instantiating and registering an EMM plugin, nor importing numba.cuda causes a call to a CUDA function.

Numba as a Dependency

Adding the implementation of an EMM Plugin to a library naturally makes Numba a dependency of the library where it may not have been previously. In order to make the dependency optional, if this is desired, one might conditionally instantiate and register the EMM Plugin like:

    import numba
    from mylib.numba_utils import MyNumbaMemoryManager
    print("Numba not importable - not registering EMM Plugin")

so that mylib.numba_utils, which contains the implementation of the EMM Plugin, is only imported if Numba is already present. If Numba is not available, then mylib.numba_utils (which necessarily imports numba), will never be imported.

It is recommended that any library with an EMM Plugin includes at least some environments with Numba for testing with the EMM Plugin in use, as well as some environments without Numba, to avoid introducing an accidental Numba dependency.

Example implementation - A RAPIDS Memory Manager (RMM) Plugin

An implementation of an EMM plugin within the Rapids Memory Manager (RMM) is sketched out in this section. This is intended to show an overview of the implementation in order to support the descriptions above and to illustrate how the plugin interface can be used - different choices may be made for a production-ready implementation.

The plugin implementation consists of additions to python/rmm/rmm.py:

# New imports:
from contextlib import context_manager
# RMM already has Numba as a dependency, so these imports need not be guarded
# by a check for the presence of numba.
from numba.cuda import (HostOnlyCUDAMemoryManager, MemoryPointer, IpcHandle,

# New class implementing the EMM Plugin:
class RMMNumbaManager(HostOnlyCUDAMemoryManager):
    def memalloc(self, size):
        # Allocates device memory using RMM functions. The finalizer for the
        # allocated memory calls back to RMM to free the memory.
        addr = librmm.rmm_alloc(bytesize, 0)
        ctx = cuda.current_context()
        ptr = ctypes.c_uint64(int(addr))
        finalizer = _make_finalizer(addr, stream)
        return MemoryPointer(ctx, ptr, size, finalizer=finalizer)

   def get_ipc_handle(self, memory):
        Get an IPC handle for the memory with offset modified by the RMM memory
        # This implementation provides a functional implementation and illustrates
        # what get_ipc_handle needs to do, but it is not a very "clean"
        # implementation, and it relies on borrowing bits of Numba internals to
        # initialise ipchandle.
        # A more polished implementation might make use of additional functions in
        # the RMM C++ layer for initialising IPC handles, and not use any Numba
        # internals.
        ipchandle = (ctypes.c_byte * 64)()  # IPC handle is 64 bytes
        source_info = cuda.current_context().device.get_device_identity()
        ptr = memory.device_ctypes_pointer.value
        offset = librmm.rmm_getallocationoffset(ptr, 0)
        return IpcHandle(memory, ipchandle, memory.size, source_info,

    def get_memory_info(self):
        # Returns a tuple of (free, total) using RMM functionality.
        return get_info() # Function defined in rmm.py

    def initialize(self):
        # Nothing required to initialize RMM here, but this method is added
        # to illustrate that the super() method should also be called.

    def defer_cleanup(self):
        # Does nothing to defer cleanup - a full implementation may choose to
        # implement a different policy.
        with super().defer_cleanup():

    def interface_version(self):
        # As required by the specification
        return 1

# The existing _make_finalizer function is used by RMMNumbaManager:
def _make_finalizer(handle, stream):
    Factory to make the finalizer function.
    We need to bind *handle* and *stream* into the actual finalizer, which
    takes no args.

    def finalizer():
        Invoked when the MemoryPointer is freed
        librmm.rmm_free(handle, stream)

    return finalizer

# Utility function register `RMMNumbaManager` as an EMM:
def use_rmm_for_numba():

_numba_memory_manager = RMMNumbaManager

Example usage

A simple example that configures Numba to use RMM for memory management and creates a device array is as follows:

# example.py
import rmm
import numpy as np

from numba import cuda


a = np.zeros(10)
d_a = cuda.to_device(a)

Running this should result in output similar to the following:

Note that there is some scope for improvement in RMM for detecting the line number at which the allocation / free occurred, but this is outside the scope of the example in this proposal.

Setting the memory manager through the environment

Rather than calling rmm.use_rmm_for_numba() in the example above, the memory manager could also be set to use RMM globally with an environment variable, so the Python interpreter is invoked to run the example as:

Numba internal changes

This section is intended primarily for Numba developers - those with an interest in the external interface for implementing EMM plugins may choose to skip over this section.

Current model / implementation

At present, memory management is implemented in the Context class. It maintains lists of allocations and deallocations:

  • allocations is a numba.core.utils.UniqueDict, created at context creation time.
  • deallocations is an instance of the _PendingDeallocs class, and is created when Context.prepare_for_use() is called.

These are used to track allocations and deallocations of:

  • Device memory
  • Pinned memory
  • Mapped memory
  • Streams
  • Events
  • Modules

The _PendingDeallocs class implements the deferred deallocation strategy - cleanup functions (such as cuMemFree) for the items above are added to its list of pending deallocations by the finalizers of objects representing allocations. These finalizers are run when the objects owning them are garbage-collected by the Python interpreter. When the addition of a new cleanup function to the deallocation list causes the number or size of pending deallocations to exceed a configured ratio, the _PendingDeallocs object runs deallocators for all items it knows about and then clears its internal pending list.

See Deallocation Behavior for more details of this implementation.

Proposed changes

This section outlines the major changes that will be made to support the EMM plugin interface - there will be various small changes to other parts of Numba that will be required in order to adapt to these changes; an exhaustive list of these is not provided.

Context changes

The numba.cuda.cudadrv.driver.Context class will no longer directly allocate and free memory. Instead, the context will hold a reference to a memory manager instance, and its memory allocation methods will call into the memory manager, e.g.:

def memalloc(self, size):
    return self.memory_manager.memalloc(size)

def memhostalloc(self, size, mapped=False, portable=False, wc=False):
    return self.memory_manager.memhostalloc(size, mapped, portable, wc)

def mempin(self, owner, pointer, size, mapped=False):
    if mapped and not self.device.CAN_MAP_HOST_MEMORY:
        raise CudaDriverError("%s cannot map host memory" % self.device)
    return self.memory_manager.mempin(owner, pointer, size, mapped)

def prepare_for_use(self):

def get_memory_info(self):

def get_ipc_handle(self, memory):
    return self.memory_manager.get_ipc_handle(memory)

def reset(self):
    # ... Already-extant reset logic, plus:

The memory_manager member is initialised when the context is created.

The memunpin method (not shown above but currently exists in the Context class) has never been implemented - it presently raises a NotImplementedError. This method arguably un-needed - pinned memory is immediately unpinned by its finalizer, and unpinning before a finalizer runs would invalidate the state of PinnedMemory objects for which references are still held. It is proposed that this is removed when making the other changes to the Context class.

The Context class will still instantiate self.allocations and self.deallocations as before - these will still be used by the context to manage the allocations and deallocations of events, streams, and modules, which are not handled by the EMM plugin.

New components of the driver module

  • BaseCUDAMemoryManager: An abstract class, as defined in the plugin interface above.
  • HostOnlyCUDAMemoryManager: A subclass of BaseCUDAMemoryManager, with the logic from Context.memhostalloc and Context.mempin moved into it. This class will also create its own allocations and deallocations members, similarly to how the Context class creates them. These are used to manage the allocations and deallocations of pinned and mapped host memory.
  • NumbaCUDAMemoryManager: A subclass of HostOnlyCUDAMemoryManager, which also contains an implementation of memalloc based on that presently existing in the Context class. This is the default memory manager, and its use preserves the behaviour of Numba prior to the addition of the EMM plugin interface - that is, all memory allocation and deallocation for Numba arrays is handled within Numba.
    • This class shares the allocations and deallocations members with its parent class HostOnlyCUDAMemoryManager, and it uses these for the management of device memory that it allocates.
  • The set_memory_manager function, which sets a global pointing to the memory manager class. This global initially holds NumbaCUDAMemoryManager (the default).

Staged IPC

Staged IPC should not take ownership of the memory that it allocates. When the default internal memory manager is in use, the memory allocated for the staging array is already owned. When an EMM plugin is in use, it is not legitimate to take ownership of the memory.

This change can be made by applying the following small patch, which has been tested to have no effect on the CUDA test suite:

diff --git a/numba/cuda/cudadrv/driver.py b/numba/cuda/cudadrv/driver.py
index 7832955..f2c1352 100644
--- a/numba/cuda/cudadrv/driver.py
+++ b/numba/cuda/cudadrv/driver.py
@@ -922,7 +922,11 @@ class _StagedIpcImpl(object):
         with cuda.gpus[srcdev.id]:

-        return newmem.own()
+        return newmem


Alongside the addition of appropriate tests for new functionality, there will be some refactoring of existing tests required, but these changes are not substantial. Tests of the deallocation strategy (e.g. TestDeallocation, TestDeferCleanup) will need to be modified to ensure that they are examining the correct set of deallocations. When an EMM plugin is in use, they will need to be skipped.

Prototyping / experimental implementation

Some prototype / experimental implementations have been produced to guide the designs presented in this document. The current implementations can be found in:

Current implementation status

RMM Plugin

For a minimal example, a simple allocation and free using RMM works as expected. For the example code (similar to the RMM example above):

import rmm
import numpy as np

from numba import cuda


a = np.zeros(10)
d_a = cuda.to_device(a)

We see the following output:

This output is similar to the expected output from the example usage presented above (though note that the pointer addresses and timestamps vary compared to the example), and provides some validation of the example use case.

CuPy Plugin

from nbep7.cupy_mempool import use_cupy_mm_for_numba
import numpy as np

from numba import cuda


a = np.zeros(10)
d_a = cuda.to_device(a)

The prototype CuPy plugin has somewhat primitive logging, so we see the output:

Numba CUDA Unit tests

As well as providing correct execution of a simple example, all relevant Numba CUDA unit tests also pass with the prototype branch, for both the internal memory manager and the RMM EMM Plugin.


The unit test suite can be run with the RMM EMM Plugin with:

A summary of the unit test suite output is:

When running with the built-in Numba memory management, the output is:

i.e. the changes for using an external memory manager do not break the built-in Numba memory management. There are an additional 6 skipped tests, from:

  • TestDeallocation: skipped as it specifically tests Numba’s internal deallocation strategy.
  • TestDeferCleanup: skipped as it specifically tests Numba’s implementation of deferred cleanup.
  • TestCudaArrayInterface.test_ownership: skipped as Numba does not own memory when an EMM Plugin is used, but ownership is assumed by this test case.


The test suite can be run with the CuPy plugin using:

This plugin implementation is presently more primitive than the RMM implementation, and results in some errors with the unit test suite:

The 8 errors are due to a lack of implementation of get_ipc_handle in the CuPy EMM Plugin implementation. It is expected that this implementation will be re-visited and completed so that CuPy can be used stably as an allocator for Numba in the future.