NumPy and numba
===============


.. code:: python

    from __future__ import print_function
    import numba
    import numpy as np
    import math
    import llvm
    import ctypes
    print("numba version: %s \nNumPy version: %s\nllvm version: %s" % (numba.__version__,np.__version__, llvm.__version__))

.. parsed-literal::

    numba version: 0.12.0 
    NumPy version: 1.7.1
    llvm version: 0.12.0


*NumPy* provides a compact, typed container for homogenous arrays of
data. This is ideal to store data homogeneous data in *Python* with
little overhead. *NumPy* also provides a set of functions that allows
manipulation of that data, as well as operating over it. There is a rich
ecosystem around *Numpy* that results in fast manipulation of *Numpy
arrays*, as long as this manipulation is done using pre-baked operations
(that are typically vectorized). This operations are usually provided by
extension modules and written in C, using the *Numpy C API*.

*numba* allows generating native code from Python functions just by
adding decorators. This code is wrapped and directly callable from
within Python.

There are many cases where you want to apply code to your *NumPy* data,
and need that code to execute fast. You may get lucky and have the
functions you want already written in the extensive *NumPy* ecosystem.
Otherwise you will end with some code that is not that fast, but that
you can improve execution time by writing code the "NumPy way". If it is
not fast enough, you can write an extension module using the *Numpy C
API*. Writing an extension module will take quite a bit of time, and
forces you to a slow compile-install-test cycle.

Wouldn't it be great if you could just write code in *Python* that
describes your function and execute it at speed similar to that of what
you could achieve with the extension module, all without leaving the
*Python interpreter*? *numba* allows that.

*Numba* is *NumPy* aware. This means:

-  It natively understands *NumPy* arrays, shapes and dtypes. *NumPy*
   arrays are supported as native types.
-  It knows how to index/slice a *NumPy* array without relying on
   *Python*.
-  It provides supports for generating *ufuncs* and *gufuncs* from
   inside the *Python* interpreter.


*Numba* understands *NumPy* arrays
----------------------------------


NumPy arrays are understood by numba. By using the *numba.typeof* we can
see that numba not only knows about the arrays themshelves, but also
about its shape and underlying dtypes:

.. code:: python

    array = np.arange(2000, dtype=np.float_)
    numba.typeof(array)


.. parsed-literal::

    array(float64, 1d, C)


.. code:: python

    numba.typeof(array.reshape((2,10,100)))


.. parsed-literal::

    array(float64, 3d, C)


From the point of view of *numba*, there are three factors that identify
the array type:

-  The base type (dtype)

-  The number of dimensions (len(shape)). Note that for numba the arity
   of each dimension is not considered part of the type, only the
   dimension count.

-  The arrangement of the array. 'C' for C-like, 'F' for FORTRAN-like,
   'A' for generic strided array.

It is easy to illustrate how the arity of an array is not part of the
dtype in *numba* with the following samples:

.. code:: python

    numba.typeof(array.reshape((2,10,100))) == numba.typeof(array.reshape((4,10,50)))


.. parsed-literal::

    True


.. code:: python

    numba.typeof(array.reshape((2,10,100))) == numba.typeof(array.reshape((40,50)))


.. parsed-literal::

    False


Specifying an array type in *numba*
-----------------------------------


In *numba* you can build the type specification by basing it on the base
type for the array.

So if numba.float32 specifies a single precision floating point number:

.. code:: python

    numba.float32


.. parsed-literal::

    float32


numba.float32[:] specifies an single dimensional array of single
precision floating point numbers:

.. code:: python

    numba.float32[:]


.. parsed-literal::

    array(float32, 1d, A)


Adding dimensions is just a matter of tweaking the slice description
passed:

.. code:: python

    numba.float32[:,:]


.. parsed-literal::

    array(float32, 2d, A)


.. code:: python

    numba.float32[:,:,:,:]


.. parsed-literal::

    array(float32, 4d, A)


As you can see, all the specified arrays are "strided". It is possible to specify that a given dimension is consecutive in memory by using ::1 in such dimension. This allows describing C-type arrays and F-type arrays.

row-major arrays (C-type) have the elements in the last dimension packed together:

.. code:: python

    numba.float32[:,::1]


.. parsed-literal::

    array(float32, 2d, C)


.. code:: python

    numba.float32[:,:,::1]


.. parsed-literal::

    array(float32, 3d, C)


column-major arrays (F-type) have elements in the first dimension packed
together:

.. code:: python

    numba.float32[::1,:]


.. parsed-literal::

    array(float32, 2d, F)


.. code:: python

    numba.float32[::1,:,:]


.. parsed-literal::

    array(float32, 3d, F)


The use of any other dimension as consecutive is handled as a strided
array:

.. code:: python

    numba.float32[:,::1,:]


.. parsed-literal::

    array(float32, 3d, A)


Note that the array arrangement does change the type, although numba
will easily coerce a C or FORTRAN array into a strided one:

.. code:: python

    numba.float32[::1,:] == numba.float32[:,::1]


.. parsed-literal::

    False


.. code:: python

    numba.float32[::1,:] == numba.float32[:,:]


.. parsed-literal::

    False


NumPy arrays as arguments
-------------------------


In all cases, NumPy arrays are passed to numba functions by reference.
This means that any change performed on the argument in the function
will modify the contents of the original matrix. This behavior maps the
usual NumPy semantics. So the array values passed as arguments to a
*numba* functions can be considered as input/output arguments.

*Numba* knows how to index and slice a *Numpy* array natively
-------------------------------------------------------------


Indexing and slicing of *NumPy arrays* are handled natively by *numba*.
This means that it is possible to *index* and *slice* a *Numpy array* in
*numba* compiled code without relying on the *Python runtime*. In
practice this means that *numba* code running on *NumPy arrays* will
execute with a level of efficiency close to that of C.

Let's make a simple function that uses indexing. For example a really
naive implementation of a sum:

.. code:: python

    def sum_all(A):
        """Naive sum of elements of an array... assumes one dimensional array of floats"""
        acc = 0.0
        for i in xrange(A.shape[0]):
            acc += A[i]
        return acc
.. code:: python

    sample_array = np.arange(10000.0)
The *pure Python* approach of this naive function is quite underwhelming
speed-wise:

.. code:: python

    %timeit sum_all(sample_array)

.. parsed-literal::

    100 loops, best of 3: 4.87 ms per loop


If we relied on *NumPy* it would be much faster:

.. code:: python

    %timeit np.sum(sample_array)

.. parsed-literal::

    100000 loops, best of 3: 17 µs per loop


But with *numba* the speed of that *naive* code is quite good:

.. code:: python

    sum_all_jit = numba.jit('float64(float64[:])')(sum_all)
.. code:: python

    %timeit sum_all_jit(sample_array)

.. parsed-literal::

    100000 loops, best of 3: 9.82 µs per loop


This is in part possible because of the native support for indexing in
*numba*. The function can be compiled in a nopython context, that makes
it quite fast:

.. code:: python

    sum_all_jit = numba.jit('float64(float64[:])', nopython=True)(sum_all)
*Numba* supports generating *NumPy* *ufuncs* and *gufuncs*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


In *NumPy* there are universal
functions(\ `*ufuncs* <http://docs.scipy.org/doc/numpy/reference/ufuncs.html>`__)
and generalized universal functions
(`*gufuncs* <http://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html>`__).

-  *ufuncs* are quite established and allows mapping of scalar
   operations over *NumPy* arrays. The resulting vectorized operation
   follow *Numpy*'s broadcasting rules.

-  *gufuncs* are a generalization of *ufuncs* that allow vectorization
   of *kernels* that work over the inner dimensions of the arrays. In
   this context a *ufunc* would just be a *gufunc* where all the
   operands of its kernels have 0 dimensions (i.e. are scalars).

*ufuncs* and *gufuncs* are typically built using *Numpy's C API*.
*Numba* offers the possibility to create *ufuncs* and *gufuncs* within
the Python interpreter, using Python functions to describe the
*kernels*. To access this functionality *numba* provides the *vectorize*
decorator and the *GUVectorize* class.

The *vectorize* decorator
^^^^^^^^^^^^^^^^^^^^^^^^^


*vectorize* is the decorator to be used to build *ufuncs*. Note that as
of this writing, it is not in the numba namespace, but in
*numba.vectorize*.

.. code:: python

    print(numba.vectorize.__doc__)

.. parsed-literal::

    vectorize(ftylist[, target='cpu', [**kws]])
    
        A decorator to create numpy ufunc object from Numba compiled code.
    
        Args
        -----
        ftylist: iterable
            An iterable of type signatures, which are either
            function type object or a string describing the
            function type.
    
        target: str
                A string for code generation target.  Defaults to 'cpu'.
    
        Returns
        --------
    
        A NumPy universal function
    
        Example
        -------
            @vectorize(['float32(float32, float32)',
                        'float64(float64, float64)'])
            def sum(a, b):
                return a + b
    
        
Its usage is pretty simple, just write the scalar function you want for your _ufunc_. Then just decorate it with _vectorize_, passing as a parameter the signatures you want your code to be generated. The generated _ufunc_ will be handled as any other _NumPy_ _ufunc_. That means that type promotions and broadcasting rules follow those of _NumPy_.

For example, let's write a sample ufunc that performs a lineal
interpolation between A and B. The 'kernel' will look like this:

.. code:: python

    def lerp(A,B,factor):
        """interpolates A and B by factor"""
        return (1-factor)*A + factor*B
.. code:: python

    lerp(0.0, 10.0, 0.3)


.. parsed-literal::

    3.0


Now let's do a *ufunc* for the floating point types. I will be using
vectorize as a function, but remember that you could just add the
decorator in the definition of the kernel itself.

.. code:: python

    lerp_ufunc = numba.vectorize(['float32(float32, float32, float32)', 'float64(float64, float64, float64)'])(lerp)
Now we can run our lerp with all of *NumPy*'s niceties, like
broadcasting of one operand (in this case the factor).

.. code:: python

    A = np.arange(0.0, 100000.0, 2.0)
    B = np.arange(100000.0, 0.0, -2.0)
    F = np.array([0.5] * 50000)
.. code:: python

    lerp_ufunc(A,B,0.5)


.. parsed-literal::

    array([ 50000.,  50000.,  50000., ...,  50000.,  50000.,  50000.])


It is also quite fast:

.. code:: python

    %timeit lerp_ufunc(A, B, 0.5)

.. parsed-literal::

    1000 loops, best of 3: 281 µs per loop


.. code:: python

    %timeit lerp(A, B, 0.5)

.. parsed-literal::

    10000 loops, best of 3: 142 µs per loop


Note that in this case the same original function can be used to
generate the *ufunc* and to execute the equivalent *NumPy* vectorized
version. When executing there will be differences in how the expression
is evaluated.

When using *NumPy* the expression is evaluated one operation at a time,
over the entire vector. *Numba* generated code will evaluate the full
expression in one go, for each element. The *numba* approach approach
avoids having temporal intermmediate arrays built, as well as avoiding
revisiting operands that are being used more than once in a expression.
This is useful with big arrays of data where there will be savings in
process memory usage as well as better cache usage.

.. code:: python

    def sample_poly(x):
        return x - x*x*x/6.0 + x*x*x*x*x/120.0
.. code:: python

    S = np.arange(0, np.pi, np.pi/36000000)
.. code:: python

    sample_poly_ufunc = numba.vectorize(['float32(float32)', 'float64(float64)'])(sample_poly)
.. code:: python

    %timeit sample_poly(S)

.. parsed-literal::

    1 loops, best of 3: 1.98 s per loop


.. code:: python

    %timeit sample_poly_ufunc(S)

.. parsed-literal::

    1 loops, best of 3: 465 ms per loop


It is also worth noting that numba's *vectorize* provides similar
convenience to that of NumPy's *vectorize*, but with performance similar
to an *ufunc*.

For example, let's take the example in `NumPy's vectorize
documentation <http://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html>`__:

.. code:: python

    def myfunc(a, b):
        "Return a-b if a>b, otherwise return a+b"
        if a > b:
            return a - b
        else:
            return a + b
.. code:: python

    myfunc_input = np.arange(100000.0)
.. code:: python

    numpy_vec_myfunc = np.vectorize(myfunc) 
    %timeit numpy_vec_myfunc(myfunc_input, 50000)

.. parsed-literal::

    10 loops, best of 3: 24.2 ms per loop


.. code:: python

    numba_vec_myfunc = numba.vectorize(['float64(float64, float64)'])(myfunc)
    %timeit numba_vec_myfunc(myfunc_input, 50000)

.. parsed-literal::

    1000 loops, best of 3: 480 µs per loop


The guvectorize decorator
^^^^^^^^^^^^^^^^^^^^^^^^^


In the same way the *vectorize* allows building *NumPy*'s *ufuncs* from
inside the Python interpreter just by writing the expression that forms
the kernel; guvectorize allows building *Numpy*'s *gufuncs* without the
need of writing a C extension module.

.. code:: python

    print(numba.guvectorize.__doc__)

.. parsed-literal::

    guvectorize(ftylist, signature, [, target='cpu', [**kws]])
    
        A decorator to create numpy generialized-ufunc object from Numba compiled
        code.
    
        Args
        -----
        ftylist: iterable
            An iterable of type signatures, which are either
            function type object or a string describing the
            function type.
    
    
        signature: str
            A NumPy generialized-ufunc signature.
            e.g. "(m, n), (n, p)->(m, p)"
    
        target: str
                A string for code generation target.  Defaults to "cpu".
    
        Returns
        --------
    
        A NumPy generialized universal-function
    
        Example
        -------
            @guvectorize(['void(int32[:,:], int32[:,:], int32[:,:])',
                          'void(float32[:,:], float32[:,:], float32[:,:])'],
                          '(x, y),(x, y)->(x, y)')
            def add_2d_array(a, b):
                for i in range(c.shape[0]):
                    for j in range(c.shape[1]):
                        c[i, j] = a[i, j] + b[i, j]
    
        
`*Generalized universal
functions* <http://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html>`__
require a *dimension signature* for the *kernel* they implement. We call
this the *NumPy generalized-ufunc signature*. Do not confuse this
*dimension signature* with the *type signature* that *numba* requires.

The *dimension signature* describe the dimensions of the operands, as
well as constraints to the values of those dimensions so that the
function can work. For example, a matrix multiply gufunc will have a
dimension signature like '(m,n), (n,p) -> (m,p)'. This means:

-  First operand has two dimensions (m,n).

-  Second operand has two dimensions (n,p).

-  Result has two dimensions (m,p).

The names of the dimensions are symbolic, and dimensions having the same
name must match in *arity* (number of elements). So in our matrix
multiply example the following constraints have to be met:

-  elements in a row of the first operand *must equal* the elements in a
   column of the second operand. Both are 'n'.

As you can see, the arity of the dimensions of the result can be infered
from the source operands:

-  Result will have as many rows as rows has the first operand. Both are
   'm'.

-  Result will have as many columns as columns has the second operand.
   Both are 'p'.

You can find more information about *Numpy generalized-ufunc signature*
in `NumPy's
documentation <http://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html#details-of-signature>`__.

When building a *gufunc* you start by writing the kernel function. You
have to bear in mind which is the dimension signature and write the code
to handle a single element. The function will take both, *input
arguments* and *results*, as parameters. The *result* will be the last
argument of the function. There shouldn't be any return value to the
function, as the result should be placed directly in the last argument.
The result of modifying an argument other than the result argument is
undefined.

.. code:: python

    def matmulcore(A, B, C):
        m, n = A.shape
        n, p = B.shape
        for i in range(m):
            for j in range(p):
                C[i, j] = 0
                for k in range(n):
                    C[i, j] += A[i, k] * B[k, j]
Note how the m, n and p are extracted from the input arguments. The
extraction of *n* is done twice to reinforce the notion that both are
the same. That extraction is not really needed, as you could directly
index inside the shape when defining the range.

To build a *generalized-ufunc* from the function is just a matter of
using the *guvectorize* decorator. The interface to *guvectorize* is
akin that of *vectorize*, but also requires the *NumPy
generalized-ufunc* signature.

.. code:: python

    gu_matmul = numba.guvectorize(['float32[:,:], float32[:,:], float32[:,:]', 'float64[:,:], float64[:,:], float64[:,:]' ],
                                  '(m,n),(n,p)->(n,p)')(matmulcore)
The result is a *gufunc*, that can be used as any othe *gufunc* in
*NumPy*. Broadcasting and type promotion rules are those on *NumPy*.

.. code:: python

    matrix_ct = 10000
    gu_test_A = np.arange(matrix_ct * 2 * 4, dtype=np.float32).reshape(matrix_ct, 2, 4)
    gu_test_B = np.arange(matrix_ct * 4 * 5, dtype=np.float32).reshape(matrix_ct, 4, 5)
.. code:: python

    %timeit gu_matmul(gu_test_A, gu_test_B)

.. parsed-literal::

    100 loops, best of 3: 8.22 ms per loop


Some recap on the difference between *vectorize* and *guvectorize*:

1. *vectorize* generates *ufuncs*, *guvectorize* generates
   *generalized-ufuncs*

2. In both, *type signatures* for the arguments and return type are
   given in a list, but in *vectorize* function signatures are used to
   specify them, while on *guvectorize* a list of types is used instead,
   the resulting type being specified last.

3. When using *guvectorize* the *NumPy generalized-ufunc* signature
   needs to be supplied. This signature must be coherent with the type
   signatures.

4. Remember that with *guvectorize* the result is passed as the last
   argument to the *kernel*, while in *vectorize* the result is returned
   by the *kernel*.


Caveats
-------


There are some points to take into account when dealing with *NumPy*
arrays inside *numba* compiled functions:

No range checks when indexing in *numba*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


In *numba* generated code **no range checking is performed when
indexing**. No range checking is performed as to allow generating code
that performs better. So you need to be careful about the code as any
indexing that goes out of range can cause a bad-access or a memory
overwrite, potentially *crashing* the interpreter process.

.. code:: python

    arr = np.arange(16.0).reshape((2,8))
    print(arr)
    print(arr.strides)

.. parsed-literal::

    [[  0.   1.   2.   3.   4.   5.   6.   7.]
     [  8.   9.  10.  11.  12.  13.  14.  15.]]
    (64, 8)


As indexing in Python is 0-based, the following line will cause an
exception error, as arr.shape[1] is 8, and the range for the column
number is (0..7):

.. code:: python

    arr[0, arr.shape[1]] = 42.0

::


    ---------------------------------------------------------------------------
    IndexError                                Traceback (most recent call last)

    <ipython-input-48-06c1e5ef06d0> in <module>()
    ----> 1 arr[0, arr.shape[1]] = 42.0
    

    IndexError: index 8 is out of bounds for axis 1 with size 8


However, as *numba* doesn't have range checks, it will index anyways. As
the index is out of bounds, and the array is in C order, the value will
overflow into the next row. In this case, in the place reserved for
element (1, 0).

.. code:: python

    @numba.jit("void(f8[:,:])")
    def bad_access(array):
        array[0, array.shape[1]] = 42.0
    bad_access(arr)
    print(arr)
In this sample case we where lucky, as the *out-of-bounds* access fell
into the allocated range. Unchecked indexing can potentially cause
illegal accesses and crash the process running the *Python interpreter*.
However, it allows for code generation that produces faster code.