===============
Numba IR Stages
===============

To allow Numba as a general purpose compiler, we provide different entry
points. These entry points also allow for further decoupling of the Numba
architecture. We propose several new Intermediate Representations (IRs),
from high-level to low-level:

    * The Python AST IR (input to a numba frontend)
    * Normalized IR

        - Like a Python AST, but contains normalized structures
          (assignments, comparisons, list comprehensions, etc)
    * Untyped IR in SSA form

        - Expanded control flow (with annotations)
    * Typed IR in SSA form
    * Low-level IR in SSA form
    * Final LLVM IR, the final input for LLVM. This IR is unportable
      since the sizes of types are fixed.

All IRs except the last are portable across machine architectures.
We get the following options for rewrites:

.. digraph:: stages

    "Python AST" -> Normalized
    Normalized -> "Stage 1"
    "Stage 1" -> "Untyped SSA"
    "Untyped SSA" -> "Stage 2"
    "Stage 2" -> "Typed SSA IR"
    "Typed SSA IR" -> "Stage 3"
    "Stage 3" -> "Low-level IR"
    "Low-level IR" -> "Stage 4"
    "Stage 4" -> "Final LLVM"

Each stage specifies a point for a series of IR transformations that together
define the input for the next stage. Each rewrite may target a different IR stage:

    * The input to `Stage 1` is an AST with all high-level syntactic constructs
      left intact. This allows high-level transformations that operate or
      expand most suitable at an abstract syntax level.
    * The input to `Stage 2` is a Function with a sequence of basic blocks
      (expanded control flow), such that all control flow can be handled
      uniformly (and there may still be high-level annotations that signify
      where loops are (so that we don't have to find them again from the
      dominator tree and CFG), where exceptions are raised and caught, etc).
      Def/use and variable merges are explicit.

      Expressions and statements are encoded in sequences of AST expression trees.
      Expressions result in implicit Values which can be referred to in subsequent
      expression trees without re-evaluation (we can replace the
      CloneNode/CloneableNode mechanisms that currently allow subtree sharing).
    * The input to `Stage 3` is the same as to `Stage 2`, except that
      it additionally contains type information (and type promotions).
    * `Stage 4` is a low-level three-address code representation that
      still has polymorphic operations, such as ``c = add(a, b)``,
      where ``a`` and ``b`` can be operands of any scalar type. A final
      pass then lowers these constructs to specialized LLVM instructions.

IRs up to the low-level IR (input to `Stage 4`) should still contain explicit
variable stores, so that passes can rewrite variable assignments to for instance
assignments to struct or extension type instance members. Keeping stores and loads
to and from these variables in the original order is important in the context
of closures (or any function call which can modify a stack variable through a pointer).

.. NOTE:: Maybe we should only do this for cell- and free variables, and fix
          closures before CFA?

We must make sure to support preloads for variable definitions across basic
blocks, e.g.::

    if ...:
        A = ...
    else:
        A = ...

    for i in range(...):
        use(A[0])

In this example we want to preload ``A.data`` (and ``a.strides[0]``). This can
probably work equally well if each expression value is an implicit definition
and we have a way to find Values given a Phi. We then get::

    ValueC = BinOp(ValueA, Add, ValueB)
    Store(Name('c'), ValueC)

Instead of::

    Assign(Name('c'), BinOp(Name('a'), Add, Name('b')))

Intermediate Representations
============================

Each IR consists of two layers, namely a higher-level Abstract Syntax Tree
encoding, specified, verified and generated by our variant of ASDL
schemas. We add the symbol ``@`` to signal that the given type
is to be treated as a "unique" object, i.e. one that compares by identity
as opposed to structural equality. These objects each carry a unique id
and are serialized in a table (and they may participate in circular
references, i.e. in a graph):

Serialization to LLVM IR (or a direct textual serialization) will
consist of something like a generated table, e.g.

.. code-block:: pycon

    >>> mod = schema.parse("foo = @Foo(str name, foo attr)")

    >>> foo1 = mod.Foo("foo1")
    >>> foo2 = mod.Foo("foo2", foo1)
    >>> foo1.attr = foo2

    >>> foo1
    Foo(name="foo1", attr=Foo("foo2", Foo(name="foo1", attr=...)))

    >>> build_llvm_ir(foo1)   # name,  id,    attr
    !0 = metadata !{ metadata !"foo1", i64 0, i64 1 }
    !1 = metadata !{ metadata !"foo2", i64 1, i64 0 }

Attributes may also be hidden using ``\``, which means the attribute
is not considered a child for the purposes of visitors or term
rewrites::

    foo = @Foo(str name, foo \attr)

Use of Schemas
--------------
We can use our schemas to:

    * Validate IR instances
    * Generate Python AST classes with typed properties and fast
      visitor dispatching
    * Generate Higher- or Lower-level LLVM IR
    * Generate conversion code to and from an ATerm representation
    * Generate a flat representation. E.g. a form of Three Address Code
    * Generate an implementation in other languages that can load a
      serialized representation and construct an AST in that langauge
    * Generate type definitions and serialization routines in
      other languages.

        .. NOTE:: This can help other languages target Numba as
                  a backend compiler more easily, since they can
                  build up the IR using in-memory data structures for
                  the IR most suitable to their needs.

    * Generate definitions for use in Attribute Grammars ([#]_)
    * Executable IR (:ref:`executable`)

.. _llvm_ir:

LLVM IR
^^^^^^^

We can generate automatic mapping code to map schema instances to
opaquely typed LLVM IR automatically, which is the abstract syntax
generated post-order. E.g. ``a + b * c`` becomes:

.. code-block:: llvm

    !0 = metadata !{ metadata !"operator", i8* "Mul" }
    !1 = metadata !{ metadata !"operator", i8* "Add" }

    define i8* some_func(i8* %a, i8* %b, i8* %c) {
    entry:
      %0 = call i8* @numba.ir.BinOp(%b, metadata !{0}, %c)
      %1 = call i8* @numba.ir.BinOp(%a, metadata !{1}, %0)
      ret %1
    }

The LLVM IR contains the high-level block structure, i.e. an ``if`` statement
will generate IR along the following lines:

.. code-block:: llvm

    define i8* @func() {
    entry:
        %0 = blockaddress(@func, %bb_test)
        %1 = blockaddress(@func, %bb_true)
        %2 = blockaddress(@func, %bb_false)
        %3 = i8*  @If(i8* %0, i8* %1, i8* %2)

      bb_test:
        ...

      bb_true:
        ...

      bb_false:
        ...
    }

An LLVM IR instance can be mapped back losslessly to an IR instance of a
different representation (e.g. a DAG).

We can use a well-defined abstraction that can map these higher-level
constructs to the lower-level equilvent. This can be used
simultenously by:

    * The control flow graph builder
    * Any IR that wants control flow expanded
    * The code generator

We can use this
construct to expand our IR to IR that corresponds more closely to
the final IR we would generate, where all control flow is expanded
to branches::

    define i8* @func() {
    entry:
        br label %bb_test

      bb_test:
        %test = ...
        br i1 %test, label %bb_true, label %bb_false

      bb_true:
        ...
        br label %bb_false

      bb_false:
        ...
        br label %bb_exit

      bb_exit:
        ...

    }

Passes can do not care about special control structures can then execute
on this IR.

.. _executable:

Executable IR
^^^^^^^^^^^^^

There are two ideas:

    * Implement a library to which the generated abstract
      LLVM IR can link. E.g. implement functions such as
      ``@BinOp(%add, %lhs, %rhs)`` (we can call this function
      ``PyNumber_Add``).

If we define new lowered IRs are a specialized subset of higher-level
IRs, we get execution for free:

   * Generate conversion code to and from a high-level Python AST
     or source code.

     For instance, ``PointerIndex(base_type, node, index)`` becomes
     ``Call(func=Name('base_type'), args=[Subscript(subnode, index)])``.
     This function can then be compiled and interpreted with Python,
     using abstract argument inputs.

Alternatively, if we already know which operations our data corresponds
to, we can generate a simple AST or bytecode evaluator.

Initial Python-like IR
----------------------

The initial, Python-like, IR is a subset of a Python AST, the
syntax exludes:

    * ``FunctionDef`` and ``ClassDef``, which are normalized
      to ``Assign`` of the function and subsequent
      decorator applications and assignments
    * No list, dict, set or generators comprehensions, which are
      normalized to ``For(...)`` etc + method calls to ``list.append``,
      etc.
    * Normalized comparisons

The initial IR is what numba decorators produce given a pure
Python AST, function or class as input.

Sample schema

.. code-block:: ocaml

    module initial {

        mod = NumbaModule(unit* stats)

        unit
          = lambda
          | class

        -- functions --
        lambda
          = Lambda(posinfo pos, funcmeta meta, str name, arguments args,
                   expr body)

        funcmeta
          = FunctionMetaData(
                -- locals={'foo': double}
                str* names,     -- 'foo'
                nbtype* types,  -- double
                bool nopython,
            )

        -- classes --
        class
          = ClassExpr(posinfo pos, bool is_jit, attrtable table, method* methods)

        attrtable
          = AttributeTable(str* attrnames, nbtype* attrtypes)

        method
          = Method(posinfo pos, methodsignature signature, stat* body)

        -- Types --

        type = nbtype
        nbtype
          = char | short | int_ | long_ | longlong
          | uchar | ushort | uint | ulong | ulonglong
          | ...
          | functype
          | methodtype

        methodtype
          = MethodSignature(functype signature,
                            bool is_staticmethod,
                            bool is_classmethod,
                            bool is_jit, -- whether this is a jit or
                                         -- autojit method
                           )
    }

.. NOTE:: Numba would construct this before starting any pipeline stage.

Untyped IR in SSA form
----------------------

Untyped IR in SSA form would be constructed internally by numba during
and after the CFA pass and before type inference. This adds to the
``initial`` schema control flow information such as::

    * SSA
    * Stack variable stack allocation (non-ssa variables)
    * Def-use and use-def chains

The high-level CFG at this stage is gone, and we
have an untyped IR in SSA form (that is, it contains PhiNode AST nodes with uses
from variables of incoming blocks). It also has a reference to all live phis
in preorder according to the dominator tree in the ``FunctionDef``.

Furthermore:

    * ``ast.Name`` is rewritten to ``NameTarget``, ``NameReference`` or ``NameParam``
    * ``If``, ``While`` and ``For`` lose the ``else`` clause
    * In-place assignments are normalized

::

    module untyped {

        function
          = FunctionDef(phi \all_phis, ...)

        phi
          = Phi(use* \incoming)

        def
          = NameTarget(posinfo pos, str id, use* \uses)
          | phi

        use
          = NameReference(posinfo pos, str id, nbtype type, def \def)
          | PhiRef(phi \def)

        lambda
          = Lambda(posinfo pos, funcmeta meta, str name, arguments args,
                   expr body, cfg cfg)

        stmt
          = For(expr target,
                expr iter,
                stmt* body)
          | ...

    }

Typed IR in SSA form
--------------------

The typed IR is similar to the untyped IR, except that every (sub-)expression
is annotated with a type.

Furthermore, the AST is augmented with
``Promotion`` terms, which promote a variable for a merge in a subsequent
CFG block. E.g.::

    # y_0
    if x > 10:
        # block_if
        y = 2           # y_1
    else:
        # block_else
        y = 3.0         # y_2

In the example above, ``block_if`` will contain a ``Promotion`` with a use
of ``y_1``, replacing all uses of ``y_1`` with the promotion value (which
can only ever be a single phi node).

I.e. we rewrite ``y_1 = 2`` to ``[ y_1 = 2 ; %0 = Promote(y_1, float) ]`` and
``PhiNode(NameRef(y_1), NameRef(y_2))`` to ``PhiNode(%0, NameRef(y_2))``.

All types adhere themselves to a schema, e.g.::

    type
      = Array(type dtype, int ndim)
      | Pointer(type base_type, int? size)
      | ...

Since the schema specifies the interfaces of the different nodes, users
can supply their own node implementation (something we can do with the
type system). Hence user-written classes can be automatically
instantiated instead of generated ones. The code generator can still
emit code for serialization.

Low-level Portable IR
=====================

The low-level portable IR is a low-level, platform agnostic, IR that:

    * The IR contains only low-level, native types such as ``int_``,
      ``long_``, pointers, structs, etc. The notion of high-level
      concepts such as arrays or objects is gone.

.. _cfg:

Control Flow
============

We can have a single abstraction that can create basic blocks and
link blocks together. For instance we for the following structure::

    For(expr target, expr iter, stmt* body, stmt* orelse)

We have the following CFG:

.. digraph:: cfg

    entry -> condition -> body -> condition -> orelse -> exit

In this CFG, ``break`` and ``continue`` correspond to the following edges:

.. digraph:: break

    break -> exit
    continue -> condition

We can use this single abstraction to:

   * Create a CFG at any time in any IR stage. For instance we can
     generate LLVM IR automatically with expanded control flow.

     .. NOTE:: This also includes the code generator, which doesn't
               have to handle any block structures.

   * Retain high-level information that allows for simple
     classification and accurate error reporting.

     .. NOTE:: This is important to allow us to easily rewrite entire
               control flow structures, such as outlining of the prange
               construct.

IR Suitability
==============
An important consideration for an IR is how well transformations are
defined over it, and how efficient those transformations are. For instance,
a pass that combines instructions works far better on a simple three-address
representation than an AST. Design considerations ([#]_):

    * Level and machine independence
    * Structure
    * Expressiveness
    * Appropriateness for transformation and code generation


To evaluate some of these metrics we will look at some concretions.

Structure
---------
We can consider expanded or abstract control flow:

    * We want to compute an SSA graph. Clearly we need a control flow
      graph in order to perform this computation.

    * We want to *outline* a prange construct. Consider what this looks
      like using unexpanded and expanded control flow.

    Unexpanded::

        For(iter=prange(...)) ->
            [ MakeClosure(For(iter=prange(adjust_bounds(...))) ; InvokeThreadPool ]

    Expanded:

        * Match a loop
        * Scan preceding statements for ``t = iter(prange(...))``
        * Outline ``[ t ; loop ]``
        * Apply ``adjust_bounds`` to ``iter(prange(...))``
        * Perform range transformation to rewrite using counters


Consider also error reporting facilities. For instance, let's assume
we want to disallow break from parallel loops.

    Unexanded:

        ``Break -> error``

    Expanded:

        * Scan for ``prange`` (similar to above, namely match a loop,
          scan preceding statement for ``iter()``)
        * Find a CFG edge that points outside the loop body region
          (e.g. the exit block of the loop, or a block further outside
          the region)

Clearly, some transformations are easier to perform using expanded control flow, e.g.:

    * Computing SSA
    * Dead-code elimination
    * Control flow simplification
    * Transformations to structured control flow
    * and so forth

Expressiveness
--------------
Consider a high-level type system, that has:

    * Full or partial functions as first-class values

        * This subsumes closurs and all methods (bound, unbound, class, static)
    * Types as first-class values
    * (Extension) Classes as first-class values
    * Containers such as

        * Arrays
        * Typed lists, sets, dicts, channels, and so forth

Program instances using these constructs must be quickly identifyable to aid
easy tranformation. For instance, ``obj.method(10)`` should be quickly transformable
using rules along the following lines:

.. code-block:: ocaml

    Attribute(Value(type=object_), attr)
        -> PyObject_GetAttrString(value, attr)

    Attribute(ExtensionMethod(..., is_jit=True), value, attr)
        -> ExtensionMethodStruct(value, method)

with:

.. code-block:: c

      [
          typedef {
              double (*method1)(double);
              ...
          } vtab_struct;

          vtab_struct *vtab = *(vtab_struct **) (((char *) obj) + vtab_offset)
          void *method = vtab[index]
      ]


A call for object then exands to ``PyObject_Call``, and a method call to a
``NativeCall`` of ``ExtensionMethodStruct.method`` with first argument
``ExtensionMethodStruct.value`` ('self').

A later pass can then combine consecutive instructions and optimize them, i.e.

.. code-block:: ocaml

    [
        method = PyObject_GetAttrString(obj, attr);
        PyObject_Call(method, value, args)
    ]
        -> PyObject_CallMethod(obj, attr, args)

A similar pass for extension methods would then avoid building the
intermediate struct.

.. NOTE:: Note how we could combine the first and second passes to detect method
          calls. Such a rule would be well-expressed on a tree or graph structure.
          The first rule as specified would work well on both a tree or three-address
          code. The latter is specified best on TAC.

The point we're trying to make is that we need to encode many different kinds
of first-class values, which have high-level types. These constucts must be
quickly identifyable and transformable using a high-level type system that
can support constructs of the high-level language.

Using a low-level type system such as LLVM's or C's means high-level types
need low-level equivalents, which means one of two things:

    * You use an abstract type classifier, which needs to be composable
    * You use a lower-level representation which more closely resembles
      the type of the value in its lowered representation (e.g. a struct
      of a function pointer and an object pointer).

LLVM facilitates the latter point, but is in no way caters to
the first. Yet what we want is the former, for the sake of expressiveness.

Reusing LLVM Passes
===================
Although LLVM IR does not cater well to some of the high-level
transformations we want to make, it provides a useful infrastructure to
do certain things. This includes:

    * SSA Graph Computation (as well as reaching definitions, etc)
    * CFG simplification
    * Finding SCCs in various graphs (CFG, SSA, call graph, etc)
    * Build a call graph
    * Aiding lower-level and TAC transformations

Below we will discuss a plan for resuability.

SSA
---
We currently construct our own CFG and compute the SSA graph from the
CFG containing abstract statements that represent definitions and uses
(loads and stores).

As mentioned, the advantage of having our own CFG construction includes:

    * Expressiveness of high-level operations
    * Automatic code generation and translation into IRs with expanded
      control flow

The advantage of having our own reaching definitions (reused from Cython's
control flow, ``Cython/Compiler/FlowControl``) is the great support for
errors and warnings for:

    * Definitely unbound variables (error)
    * Potentionally unbound variables (warning)
    * Unused variables (warning)

as well as other categories. LLVM on the other hand classifies reads
from uninitialized variables as undefined behaviour.

Numba initializes potentially unbound variables to a "bad" value (``nan``,
``0xdeadbeef``, etc). We can use the same mechanism to construct valid
LLVM IR, and compute the SSA graph from our subset program, consisting only
of:

    * expanded control flow
    * variable stores
    * variable loads

This LLVM-constructed SSA graph can be mapped back to our high-lever IR
with relative ease if we simply remembered which LLVM basic block associates
with which basic block in our IR.


.. NOTE:: This operates under the assumption that we have a general
          framework that can map LLVM transformations back to our IR
          representation automatically.

Type Dependence Graph Construction
----------------------------------
From the SSA graph we compute a type graph by inferring all variable
assignments. This graph often has cycles, due to the back-edge in
the CFG for loops. For instance we may have the following code::

    x = 0
    for i in range(10):
        x = f(x)

    y = x

Where ``f`` is an external autojit function (i.e., it's output type depends
on it's dynamic input type).

We get the following type graph:

.. digraph:: typegraph

    x_0 -> int

    x_1 -> x_0
    x_1 -> x_2
    x_2 -> f
    f -> x_1

    y_0 -> x_1

    i_0 -> range
    range -> int

Below we show the correspondence of the SSA variable definitions to their
basic blocks:

.. digraph:: cfg

    "entry: [ x_0, i_0 ]" -> "condition: [ x_1 ]" -> "body: [ x_2 ]"
    "body: [ x_2 ]" -> "condition: [ x_1 ]"
    "condition: [ x_1 ]" -> "exit: [ y_2 ]"

.. entry -> x_0
.. entry -> i_0
.. condition -> x_1
.. body -> x_2
.. exit -> y_2

Our goal is to resolve this type graph in topological order, such that
we know the type for each variable definition (``x_0``, ``x_1``, etc).

In order to do a topological sort, we compute the condensation graph
by finding the strongly connected components and condensing them
into single graph nodes. The resulting graph looks like this:

.. digraph:: typegraph

    x_0 -> int
    SCC0 -> x_0
    y_0 -> SCC0

    i_0 -> range
    range -> int

And ``SCC0`` contains the cycle in the type graph. We now have a
well-defined preorder for which we can process each node in topological
order on the transpose graph, doing the following:

    * If the node represents a concrete type, propagate result along edge
    * If the node represents a function over an argument of the given input types,
      infer the result type of this function
    * For each SCC, process all internal nodes using fixpoint iteration
      given all input types to the SCC. Update internal nodes with their result
      types.

Building a Call Graph
---------------------
This will be useful to use LLVM for in order to:

    * Efficiently infer types of direct or indirect uses of recursion for autojit
      functions or methods
    * Detect such recusion by letting LLVM find the SCCs in the call graph, and
      resolving in an analogous and cooperative manner to how we resolve the type graph

Writing LLVM Passes
-------------------
We have a few constructs that may be better written as LLVM passes over simpler
(lower-level) constructs (with exapnded control flow, three-address code arithmetic
instructions, etc). We showed one such example already, but one can think
of many others.

We can define the penultimate IR in LLVM, such that any passes before code generator
and after lowering of high-level constructions an be performed on this IR. This allows
us to use the full power of LLVM where it is most adequate. Furthermore, we can likely
do away with (most of) our code generator if we define our IR stages well.

References
==========
.. [#] Attribute Grammars in Haskell with UUAG, A. Loh, http://www.andres-loeh.de/AGee.pdf
.. [#] Advanced Compiler Design and Implementation, Steven S. Muchnick