Chapter 1: Basic Compiler¶
AST frontend and LLVM backend¶
This chapter introduces the fundamental components of our compiler: the AST frontend and LLVM backend. We show how to parse Python functions into an intermediate representation (RVSDG-IR) and then compile them to executable code using LLVM.
The chapter covers:
- How to implement a frontend that converts Python AST to RVSDG-IR
- How to implement a backend that generates LLVM IR
- How to JIT compile and execute the generated code
Imports and Setup¶
Import all necessary modules for the basic compiler implementation.
from __future__ import annotations
import builtins
from collections import ChainMap
from llvmlite import binding as llvm
from llvmlite import ir
from sealir import ase, rvsdg
from sealir.llvm_pyapi_backend import (
CodegenCtx,
CodegenState,
JitCallable,
PythonAPI,
_codegen_loop,
)
from sealir.rvsdg import grammar as rg
from typing_extensions import TypedDict
from utils import Pipeline, Report
Frontend Implementation¶
The frontend accepts a Python function object, reads its source code, and parses its Abstract Syntax Tree (AST). It then transforms the AST into a Regionalized Value-State Dependence Graph (RVSDG), a representation that simplifies further intermediate representation (IR) processing. The RVSDG uses a data-flow centric encoding in which control-flow constructs are mapped into regions. These regions function much like ordinary operations, with clearly defined sets of input and output ports. Additionally, state is explicitly encoded as I/O, so that every operation maintains a pure operational appearance.
class FrontendOutput(TypedDict):
rvsdg_expr: object
dbginfo: object
@Pipeline
def pipeline_frontend(fn, pipeline_report=Report.Sink()) -> FrontendOutput:
"""
Frontend code is all encapsulated in sealir.rvsdg.restructure_source
"""
with pipeline_report.nest("Frontend", default_expanded=True) as report:
rvsdg_expr, dbginfo = rvsdg.restructure_source(fn)
report.append("Debug Info on RVSDG", dbginfo.show_sources())
report.append("RVSDG", rvsdg.format_rvsdg(rvsdg_expr))
return {"rvsdg_expr": rvsdg_expr, "dbginfo": dbginfo}
Simple Frontend Example¶
Here's a simple function to illustrate the frontend
if __name__ == "__main__":
def exercise_frontend_simple(x, y):
return x + y
cres = pipeline_frontend(fn=exercise_frontend_simple)
# Results are returned as a SimpleNamespace
print(rvsdg.format_rvsdg(cres.rvsdg_expr))
transformed_exercise_frontend_simple = Func (Args (ArgSpec 'x' (PyNone)) (ArgSpec 'y' (PyNone))) $0 = Region[239] <- !io x y { $1 = PyBinOp + $0[0] $0[1], $0[2] } [314] -> !io=$1[0] !ret=$1[1]
Alternatively use Report to display the results
if __name__ == "__main__":
report = Report("Frontend", default_expanded=True)
cres = pipeline_frontend(
fn=exercise_frontend_simple, pipeline_report=report
)
report.display()
Frontend
The function's RVSDG-IR includes a region:
$0 = Region[236] <- !io x y
{
...
} [302] -> !io=$1[0] !ret=$1[1] x=$0[1] y=$0[2]
The region has three input ports: !io
, x
, and y
.
Its output ports are !io
, !ret
, x
, and y
.
Here, !io
represents the state. Because Python functions are imperative,
each function carries an implicit state that must be updated for any
side-effect.
The !ret
port holds the function's return value.
The x
and y
output ports convey the region's internal value state, but the
function node ignores these values.
The single operation in the function is:
$1 = PyBinOp + $0[0] $0[1], $0[2]
This is a Python binary operation. It uses input operands from $0
, which is
the header of the function's region. In this notation, $0[n]
refers to a
specific port: $0[0]
corresponds to !io
, $0[1]
to x
, and $0[2]
to
y
.
Complex Frontend Example¶
Below is a more intricate example that requires restructuring of the
control flow. This is due to the use of the break
statement to exit a
for-loop.
RVSDG enforces structured control flow. Only three types of control-flow regions are permitted:
- Linear region, as
Region {...}
; - If-Else region, as
If {...} Else {...} EndIf
; - Loop region, as
Loop {...} EndLoop
.
if __name__ == "__main__":
def exercise_frontend_loop_if_break(x, y):
c = 0
for i in range(x):
if i > y:
break
c += i
return c
report = Report("Frontend", default_expanded=True)
cres = pipeline_frontend(
fn=exercise_frontend_loop_if_break, pipeline_report=report
)
report.display()
Frontend
Observations from the RVSDG-IR above:
- The for-loop is restructured into a
Loop
composed ofIf-Else
regions. - The
Loop
region is tail-controlled; its first output port acts as the loop condition. See!_loopcond_0002
. - An extra
If-Else
follows the loop to adjust the value ofi
.
The beauty of the structured control-flow is that everything can be encapsulated as a plain data-flow operation, including any region. Everything is just an operation with some input and output ports. This simplifies the rest of the compiler.
Backend Implementation¶
SealIR includes a lightweight LLVM backend that emits the Python C-API, which executes the RVSDG-IR. The example below demonstrates how to use this backend
def _determine_arity(root: ase.SExpr) -> int:
"""Helper function to get the arity of the function node"""
match root:
case rg.Func(args=rg.Args() as args):
return len(args.arguments)
case _:
raise TypeError(root._head)
def backend(root, ns=builtins.__dict__, codegen_extension=None):
"""
Emit LLVM using Python C-API.
root: the RVSDG expression for the function
ns: is the dictionary of global names. A JIT is assumed. Object pointer for
each key is used.
codegen_extension: Optional. If defined, it is the function to call when an
unknown operation is encountered by the code generator. This argument
is used in the later chapters.
Warning:
- This is for testing only.
- Does NOT do proper memory management.
- Does NOT do proper error handling.
"""
llvm.initialize()
llvm.initialize_native_target()
llvm.initialize_native_asmprinter()
mod = ir.Module()
ll_byte = ir.IntType(8)
ll_pyobject_ptr = ll_byte.as_pointer()
# Make LLVM function
arity = _determine_arity(root)
assert arity >= 1
actual_num_args = arity
fnty = ir.FunctionType(
ll_pyobject_ptr, [ll_pyobject_ptr] * actual_num_args
)
fn = ir.Function(mod, fnty, name="foo")
# init entry block and builder
builder = ir.IRBuilder(fn.append_basic_block())
retval_slot = builder.alloca(ll_pyobject_ptr)
builder.store(ll_pyobject_ptr(None), retval_slot) # init retval to NULL
bb_main = builder.append_basic_block()
builder.branch(bb_main)
builder.position_at_end(bb_main)
ctx = CodegenCtx(
llvm_module=mod,
llvm_func=fn,
builder=builder,
pyapi=PythonAPI(builder),
global_ns=ChainMap(ns, __builtins__),
codegen_extension=codegen_extension,
)
# Emit the function body
ase.traverse(root, _codegen_loop, CodegenState(context=ctx))
return mod
JIT Compilation¶
The following function takes a LLVM module and JIT compile it for execution:
def jit_compile(mod, rvsdg_expr):
"""JIT compile LLVM module into an executable function for this process."""
llvm_ir = str(mod)
# Create JIT
lljit = llvm.create_lljit_compiler()
rt = (
llvm.JITLibraryBuilder()
.add_ir(llvm_ir)
.export_symbol("foo")
.add_current_process()
.link(lljit, "foo")
)
ptr = rt["foo"]
arity = _determine_arity(rvsdg_expr)
return JitCallable.from_pointer(rt, ptr, arity)
Compiler Pipeline¶
The following is a simple compiler pipeline:
class BackendOutput(TypedDict):
llmod: object
@pipeline_frontend.extend
def pipeline_backend(
rvsdg_expr, pipeline_report=Report.Sink()
) -> BackendOutput:
with pipeline_report.nest("Backend", default_expanded=True) as report:
llmod = backend(rvsdg_expr)
report.append("LLVM", llmod)
return {"llmod": llmod}
class JITOutput(TypedDict):
jit_func: object
@pipeline_backend.extend
def compiler_pipeline(llmod, rvsdg_expr) -> JITOutput:
jit_func = jit_compile(llmod, rvsdg_expr)
return {"jit_func": jit_func}
Testing Framework¶
Define a testing framework to verify compiler correctness.
def run_test(fn, jt, args, *, verbose=False, equal=lambda x, y: x == y):
res = jt(*args)
got = fn(*args)
if verbose:
report = Report("Testing report", default_expanded=False)
report.append("Args", args)
report.append("JIT output", res)
report.append("Expected output", got)
report.display()
assert equal(res, got), (res, got)
return res
Complete Example¶
The following puts everything together. Running the frontend to generate RVSDG-IR. Then, emitting LLVM using the backend. Finally, JIT'ing it into executable code and verifying it.
if __name__ == "__main__":
def sum_ints(n):
c = 0
for i in range(n):
c += i
return c
report = Report("Compiler Pipeline", default_expanded=True)
jt = compiler_pipeline(fn=sum_ints, pipeline_report=report).jit_func
report.display()
run_test(sum_ints, jt, (12,), verbose=True)
Compiler Pipeline
Testing report
(12,)
66
66