Compiler 101

Introduction

This page intends to provide a high-level overview of how the OpenDream compiler parses DM and generates bytecode. The goal is to give new contributors a general idea of where things are and the basic order of execution.

Since OpenDream is still heavily in development, this page does not intend to get too deep into the specific C# methods and variables being utilized as they are subject to change.

This page expects the reader to have an existing comprehension of DM at an intermediate level. It is not a DM tutorial and will not explain DM concepts such as #define macros.

For a better and more detailed look at compilers, read the free book that OpenDream was initially based on: Crafting Interpreters

Definitions

compiler : Takes a source code file and turns it into bytecode instructions that will be executed at runtime by the interpreter. (Note: Compilers come in many shapes and flavors; this definition refers specifically to OpenDream as a bytecode compiler).
interpreter : Runs a compiled program by reading the bytecode file and executing its opcode instructions. (Note: Again, this definition is specific to OpenDream and interpreters come in many flavors).
opcode : A bytecode instruction written by the compiler and executed by the interpreter (e.g. DreamProcOpcode.Multiply).
token : A collection of characters and/or punctuation that convey meaningful info (e.g. the text else becomes a single Token of type TokenType.DM_Else).
lexer : Takes raw text and converts it to tokens.
parser : Converts a series of tokens into an abstract syntax tree of expressions and statements.
expression : A piece of code that evaluates to a value (e.g. 5 * 2 is a binary multiplication expression that evaluates to the unary constant integer expression 10).
statement : Typically a var declaration or something that controls flow (e.g. if() or switch()).
abstract syntax tree (AST) : A tree of expressions that constitute the grammar of the language. For example, isnull(double_me) ? "var foo is null" : double_me * 2 is a ternary expression with branches that include the unary constant string expression "var foo is null" and the binary multiplication expression double_me * 2. The multiplication expression (*) has two unary branches: the double_me identifier and the 2 constant integer. All of these branches, linked together, form the abstract syntax tree (AST). Each of these expressions represent a node on the tree at each branching point. For a more in-depth explainer with images, see Crafting Interpreters chapter 5.
constant folding : A basic optimization that evaluates expressions at compile-time when their values are constant. The contrived example var/const/foo = 5; var/bar = foo * 2 can be optimized by replacing it with the equivalent var/const/foo = 5; var/bar = 10.

Step 0: Setup

First, the user-provided compiler options are parsed and any relevant warnings/errors are emitted. The args for .dme or .dm file(s) to be compiled are passed to the preprocessor.

Step 1: Preprocessing

Preprocessing is the first step of compiling DM code. DMPreprocessor handles any #define macros that were provided as compiler options then begins preprocessing files specified in #include statements. One file at a time, DMPreprocessorLexer takes the raw text of the file and roughly tokenizes it just enough for the preprocessor's needs. These tokens are imprecise and some simply store text that are later lexed into more-specific tokens (see Step 2).

DMPreprocessor then enumerates these tokens and handles any directives such as #if. This involves taking the list of tokens on that line of code and passing them to DMPreprocessorParser for parsing and evaluation. DMPreprocessorParser exists separately from the main DMParser due to the preprocessor's different behaviors and simplicity.

Once all included files have had their directives lexed and parsed by DMPreprocessor, the compiler moves on to a thorough lexing of the entire code.

Step 2: Lexing

DMLexer takes a list of tokens from the preprocessor; these types usually have a TokenType.DM_Preproc_ prefix. The lexer essentially iterates over all of the preprocessor tokens with a giant switch() statement. For example, a TokenType.DM_Preproc_Punctuator token containing the text == creates a new TokenType.DM_EqualsEquals token. This entire phase is straightforward and simple.

Step 3: Parsing

DMParser parses the tokens to build the abstract syntax tree (AST) for DM (DMAST) as a series of branching DMASTNode nodes. There are quite a large number of nodes covering every branch of the tree from the DMAST (DMASTFile) down to proc definitions (DMASTProcDefinition) to expressions (e.g. DMASTGreaterThanOrEqual) to constant values (e.g. DMASTConstantInteger).

The top-level DMASTFile node is broken down into branches of inner blocks which can continue branching further. Here is an example:

/proc/number_doubler(var/input)
    if(!isnum(input))
        return null
    else
        input *= 2
        return input

The proc declaration constitutes a single inner block. It contains two blocks of its own: The if() statement's body and the else statement's body. The if() and else blocks are broken down further into nodes for the return statements and multiplication expression.

To recap, the parser uses tokens to build the DMAST which is a tree of nodes from the highest level (a file of code) down to blocks of code down to individual constants, statements, and expressions.

Step 4 (Deprecated): `DMASTFolder`

DMASTFolder is the legacy implementation of constant folding. It operates at the DMAST level. It is in the process of being phased out and no new code should be added to it. Constant folding is now handled primarily by the peephole optimizer (see later Bytecode Optimization step).

Step 5: Code Tree

DMCodeTreeBuilder uses the DMAST to perform several important steps:

Step 5.1: Build an equivalent to BYOND's object tree of types, vars, and procs.
Step 5.2: For each type's DMObject, build its initialization proc. This initializes any non-constant variables on the type when the object is created.
Step 5.3: Compile all procs. Each DMProc creates a DMProcBuilder which processes the proc's arguments and all of the DMASTProcStatements in the proc body. The end result of proc compilation is writing IAnnotatedBytecode which consists of DreamProcOpcode instructions, data, and other info used for bytecode optimization.

At this point, DM code compilation is essentially finished. The remaining steps cover .dmm map handling, bytecode optimization, JSON emission, etc.

Step 6: Map Conversion

The compiler converts .dmm map files to JSON for inclusion in the final bytecode file. This includes its own preprocessing, lexing, parsing, and JSON conversion. The individual steps are mostly comparable to their DM equivalent, just being performed again on maps.

Step 7: Bytecode Optimization

The bytecode optimizer takes in a list of the IAnnotatedBytecode and performs several optimization passes:

Code labels that are not referenced are removed.
Code labels that flow directly into other labels are merged and their references are updated.
Jump instructions that jump to the next immediate position in the bytecode (effectively a no-op) are removed.

The bytecode undergoes further optimization passes in the peephole optimizer. The peephole optimizer looks for certain patterns of instructions in the bytecode and attempts to simplify/merge them to reduce instruction overhead at runtime. For example, a proc that ends with return 5 would have the opcodes PushFloat 5 and Return. These are merged into a single ReturnFloat 5 instruction. These simple optimizations reduce the number of instructions in the bytecode file by several tens of thousands. The peephole optimizer also handles constant folding.

Here are the peephole optimizer passes:

PeepholeOptimization : Most of the peephole optimizations and constant folding.
BytecodeCompactor : Most of the N variants of opcodes. For example, several PushFloat opcodes in a sequence can be merged into one PushNFloats instruction with a list of N floats.
ListCompactor : Merges PushN opcodes with CreateList opcodes to reduce list creation overhead.

All of these optimizations are performed in-place on the list of annotated bytecode; we don't need to convert anything nor construct anything new.

Step 8: JSON Representation

The results of compilation are saved to a single JSON bytecode file. To do this, each compiled DMObject and DMProc need to be converted to a JSON representation; these are DreamTypeJson and ProcDefinitionJson, respectively. The JSON also stores the converted maps, interface files, miscellaneous metadata, and other things that won't be discussed here.

Finally, if there were no compilation errors along the way, the compiler will be successfully finished and we will be left with the compiled JSON file that can be executed by the interpreter.

Summary

In conclusion, compilation can be summarized as the following steps:

Preprocessor lexing to make rough imprecise tokens and parsing to handle macros
DM lexing to produce precise tokens
DM parsing to build the DMAST and create statements and expressions
(Deprecated) DMASTFolder
Code tree construction (BYOND object tree, init procs, proc compilation)
Map conversion
Bytecode optimization
JSON representation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compiler 101

Introduction

Definitions

Step 0: Setup

Step 1: Preprocessing

Step 2: Lexing

Step 3: Parsing

Step 4 (Deprecated): `DMASTFolder`

Step 5: Code Tree

Step 6: Map Conversion

Step 7: Bytecode Optimization

Step 8: JSON Representation

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Compiler 101

Introduction

Definitions

Step 0: Setup

Step 1: Preprocessing

Step 2: Lexing

Step 3: Parsing

Step 4 (Deprecated): DMASTFolder

Step 5: Code Tree

Step 6: Map Conversion

Step 7: Bytecode Optimization

Step 8: JSON Representation

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Step 4 (Deprecated): `DMASTFolder`