Join the Compiler Creation Club

So you want to write your own programming language?

Welcome to the world of compilers. Writing your own compiler can be a very rewarding process and in this article we'll be covering three areas at a fairly brusque pace:

  • go over a little bit of language and compiler theory (§1)
  • implement a functional language grammar into a parser (§2)
  • create a command-line compiler (§3)

Be warned, this article is not the be-all and end-all of compiler theory, in fact it leaves a lot of subjects untouched and half-explained. The aim is to get you into to futzing around with parsers as quick as possible and from there on you can expand your knowledge.

§1. Theory

In this section we're going to be going over a hyperspeed crash-course in compiler and language theory. We're going to be covering things in the broadest strokes to keep things digestible.

Syntax & Semantics

Languages are defined by an syntax — basically the rules or form of the language. Your compiler takes the plaintext source of the language and validates it against the predefined rules of the language.

In computer science, the syntax of a computer language is the set of rules that defines the combinations of symbols that are considered to be a correctly structured document or fragment in that language.

Syntax, Wikipedia

If syntax is the form, semantics are the meaning of the symbols.

Semantics describes the processes a computer follows when executing a program in that specific language.

Semantics, Wikipedia

Say you want to make a language which helps you solve arbitrary algebraic statements — let's define some rules for the statement

a = b * 2 + c  
  1. the left hand side must always be a variable
  2. multiplicative parts of the equation must be evaluated before the additive parts

If we had written a mathematically equivalent statement like

a - c = b * 2  

this would have violated our first rule — also known as a syntax error. Syntactically they are different equations but semantically they are the same equation.

Statements & Expressions

Every operation starts out with a Program which is broken down into a series of statements. Statements do not return results and are composed of expressions. Here is an example of a statement:

var foo;  

In the above there is no computation, only the creation of a variable with a value. If we were to modify it to:

bar = var foo;  

this would not compile because var foo is a statement and thus does not return a value to the left hand side of the equation. On the other hand, expressions can be just about anything and return some kind of output:

x = func(3);  

In the above example we have the function func being invoked with a parameter of 3 and returning the result into x. func(3) is an expression because it does something and returns a value to the left.

An expression in a programming language is a combination of explicit values, constants, variables, operators, and functions that are interpreted according to the particular rules of precedence and of association for a particular programming language, which computes and then produces (returns, in a stateful environment) another value.

Expressions, Wikipedia

So you can see, just about anything can be an expression if it returns some kind of value.

Compilers

A compiler at it's very simplest form takes an input language and returns another language as the output. This output can be anything — JavaScript, Bytecode, Machine Code etc, but it essentially just transforms the input into an acceptable output. To break it down a bit further, you can split the compiler into three modules: the lexer, the parser and the generator.

A compiler is likely to perform many or all of the following operations: lexical analysis, preprocessing, parsing, semantic analysis (Syntax-directed translation), code generation, and code optimization.

Compiler, Wikipedia

The purpose of the lexer is fairly simple: take the raw input (source) and transform it into a series of tokens that the parser can understand. The parser transforms this sequence of tokens into a tree-like structure with which you can transform. Finally the generator takes the tree you have built and turns it into the language you want your compiler to emit.

Syntax Trees

A tree can be a great way of semantically representing what your input program is doing by decoupling it from its syntax. For example from the algebraic equation we used before

a = b * 2 + c  

we can parse it into the following representation:

We can see all our constituent elements, also known as terminals or terminal symbols, are formed into a tree-like structure. In the above "assign", "multiply" and "add" are all expressions, the latter two being components of the former. They are also non-terminal symbols: symbols made of other terminal and non-terminal symbols.

What we can also do is traverse the tree and perform operations on it. For example, you could transform the tree into a few statements:

  1. variable b is multiplied by 2
  2. the result of (1) is added to c
  3. the result of (2) is stored into a'

In JavaScript, the equation is already valid syntax so just to complete the example we will use some ia32 assembly code instead:

MOV EAX,[b]        ; EAX = b  
MUL EAX,2          ; EAX = EAX * 2  
MOV EBX,[c]        ; EBX = c  
ADD EBX,EAX        ; EBX = EBX + EAX  
MOV [a],EBX        ; a = EBX  

PEGs

A Parsing Expression Grammar (PEG) is one way we can generate a parser. In the coming section we will be using a PEG to both tokenize (lexer) and parse our source langugage into a syntax tree.

PEG.js is a simple parser generator for JavaScript that produces fast parsers with excellent error reporting. You can use it to process complex data or computer languages and build transformers, interpreters, compilers and other tools easily.

PEG.js Project Page

A PEG is basically a set of rules that match or reject tokens and can emit tree nodes or modify parser state.

§2. Language

In this section will designing a simple language that compiles to JavaScript using Node.js, PEG.js, ast-types and Escodegen through Hyperglot. The language is distinctly going to have a functional, lispy syntax that consists of assignments, functions, lists and literals. For other features like control statements we will rely on native JavaScript functions to do our dirty work.

Setting Up

Before you start, please read the documentation for PEG.js as it will help when describing the grammar of the language. You will also need a release of Hyperglot. Download the latest Hyperglot.nw release and the appropriate node-webkit binary for your operating system and you're good to go. When you create a new project, your input syntax will be:

Program = stmts:Statement* {  
    return ast.program(stmts);
}

Statement =  

We're using the SpiderMonkey Parser API and all programs start with a program node containing a series of statements. We will also need a few terminal symbol definitions which you can insert down the bottom of your parser definition:

 digit = [0-9]
lparen = '(' ws  
rparen = ')' ws  
lbrace = '{' ws  
rbrace = '}' ws  
lsqbct = '[' ws  
rsqbct = ']' ws  
 colon = ':' ws

ws = wschar*  
wschar = [ \r\n\t]  

Primitives

There are a few kinds of primitives we'd like to support:

  • Identifiers (variable names)
  • Booleans
  • Numbers
  • Strings
  • Null

Identifiers

Identifiers will be any combination of uppercase and lowercase characters — adding support for valid JavaScript identifiers will be left as an exercise.

Identifier = chars:[A-Za-z]+ ws {  
    return ast.identifier(chars.join(''));
}

Numbers

Numbers can be one of two forms in our language: Integer types and floating-points (aka Float) types:

Float = a:digit+ '.' b:digit+  ws {  
    var num = parseFloat(a.join('') + '.' + b.join(''));
    return ast.literal(num);
}

Integer = digits:digit+ ws {  
    var num = parseInt(digits.join(''));
    return ast.literal(num);
}

To keep things simple, we will put both these forms under the same category.

Number  
    = Float
    / Integer

Reader Challenge: how would you modify the above grammars to support exponential notation, or integers in another base (e.g. hexadecimal)?

Strings

Strings are pretty simple — they're going to consist of a double quote and any number of any character terminated with another double quote:

String = '"' chars:[^"]* '"' {  
    return ast.literal(chars.join(''));
}

Booleans

Boolean literal values are darn simple and are two of three keywords in our language:

Boolean  
    = Yes
    / No

Yes = "yes" ws {  
    return ast.literal(true);
}

No = "no" ws {  
    return ast.literal(false);
}

Null

And here's the final keyword: nil!

Nil = "nil" ws {  
    return ast.literal(null);
}

Assignment

Assignment can be thought of as both statement and an expression:

x = 10;            // just a statement  
x = (y = 10);      // "y = 10" is an expression  

In our language we will say:

(x 10)
(x (y 10))

This translates to the following PEG rules:

AssignStmt = expr:AssignExpr {  
    return ast.expressionStatement(expr);
}

AssignExpr = lparen l:Identifier r:Expression rparen {  
    return ast.assignmentExpression('=', l, r);
}

But wait a second, what is our Expression rule there?

Expression  
    = AssignExpr
    / Identifier
    / String
    / Number
    / Boolean
    / Nil

Remember from §2 when we said just about anything can be an expression? There you go — we just made about everything an expression. If you update your Statement rule with:

Statement  
    = AssignStmt

you can go to the Input pane of Hyperglot and start writing valid language code like:

(int 10)
(str "hello")
(float 12.34)
(bool yes)
(x (y (z nil)))

You will get a successfully compiled message. You can go to the Output pane of Hyperglot and examine the syntax tree you generated along with the generated JavaScript. Pretty neat huh?

Lists

Lists are pretty simple and are yet another kind of expression:

(list [1 2 3])

which translates pretty directly into:

List = lsqbct items:Expression* rsqbct {  
    return ast.arrayExpression(items);
}

Rememeber to add List to your Expression rule!

Functions

We've been saving the best for last. Functions in our language are actually two separate expressions: functions as a value to be assigned to variables, and invocation expressions whereby we invoke the function into operation.

Definition

We're going to be using the curly braces to define our function scope. Additionally since we don't really want a return keyword or deal with multiple input parameters, we're going to make our functions take in one argument and return one result.

FunctionExpr = lbrace stmts:Statement* rbrace {  
    var arg = ast.identifier("$");
    stmts.push(ast.returnStatement(arg));
    var block = ast.blockStatement(stmts);
    return ast.functionExpression(null, [arg], block);
}

So to break this rule down

  • we're specifying that our function takes a variable called $ in
  • we're appending a statement to the end of the function scope that returns the same variable.
  • wrapping all the statements up in a block

Reader Question: what happens when a function is invoked without a parameter? How would you fix the definition above?

Invocation

To invoke a function, we're going to use the following syntax:

(func: arg)

where func is an identifier or function expression and arg is an optional parameter to the function. If you notice, this kind of looks like both a statement and an expression, and you'd be right!

Callable  
    = FunctionExpr
    / Identifier

CallStmt = expr:CallExpr {  
    return ast.expressionStatment(expr);
}

CallExpr = lparen l:Callable colon arg:Expression? rparen {  
    return (arg == "")
        ? ast.callExpression(a, [])
        : ast.callExpression(a, [arg]);
}

Again, remember to add these to your Expression and Statement rules.

Wait a second …

There's a problem: in our original plans our only valid identifiers were A-z strings. How do we use the $ symbol in our function scope? The answer is just to add another rule and swap out existing uses of Identifier with Symbol:

Symbol  
    = Identifier
    / Cash

Cash = '$' ws {  
    return ast.identifier('$');
}

It's done! You'll see that this input:

(increment {
    ($ (add: 1))
})

yields the following JavaScript:

increment = function ($) {  
    $ = add(1);
    return $;
};

Reader Question: $ can be used in the global scope outside of functions. How would you rectify this?

The Final Syntax

Here it is, the full syntax of our programming language:

Program = stmts:Statement* {  
    return ast.program(stmts);
}

Statement  
    = CallStmt
    / AssignStmt

Expression  
    = FunctionExpr
    / AssignExpr
    / CallExpr
    / ListExpr
    / Symbol
    / String
    / Number
    / Boolean
    / Nil

Callable  
    = FunctionExpr
    / Symbol

AssignStmt = expr:AssignExpr {  
    return ast.expressionStatement(expr);
}

CallStmt = expr:CallExpr {  
    return ast.expressionStatement(expr);
}


AssignExpr = lparen l:Symbol r:Expression rparen {  
    return ast.assignmentExpression('=', l, r);
}

CallExpr = lparen l:Callable colon arg:Expression? rparen {  
    return (arg == "")
        ? ast.callExpression(l, [])
        : ast.callExpression(l, [arg]);
}

FunctionExpr = lbrace stmts:Statement* rbrace {  
    var arg = ast.identifier("$");
    stmts.push(ast.returnStatement(arg));
    var block = ast.blockStatement(stmts);
    return ast.functionExpression(null, [arg], block);
}


ListExpr = lsqbct items:Expression* rsqbct {  
    return ast.arrayExpression(items);
}

Symbol  
    = Identifier
    / Cash

Identifier = chars:[A-Za-z]+ ws {  
    return ast.identifier(chars.join(''));
}

Cash = '$' ws {  
    return ast.identifier('$');
}

String = '"' chars:[^"]* '"' {  
    return ast.literal(chars.join(''));
}

Nil = "nil" ws {  
    return ast.literal(null);
}

Boolean  
    = Yes
    / No

Yes = "yes" ws {  
    return ast.literal(true);
}

No = "no" ws {  
    return ast.literal(false);
}

Number  
    = Float
    / Integer

Float = a:digit+ '.' b:digit+  ws {  
    var num = parseFloat(a.join('') + '.' + b.join(''));
    return ast.literal(num);
}

Integer = digits:digit+ ws {  
    var num = parseInt(digits.join(''));
    return ast.literal(num);
}


 digit = [0-9]
lparen = '(' ws  
rparen = ')' ws  
lbrace = '{' ws  
rbrace = '}' ws  
lsqbct = '[' ws  
rsqbct = ']' ws  
 colon = ':' ws

ws = wschar*  
wschar = [ \r\n\t]  

§3. The Compiler

So in §2 we defined the rules for our language. All that's left to do is hand-craft a compiler that takes our source language, parses it, and transforms the generated tree into JavaScript and then all the other features of compilers like opening saving files, command line options etc.

Wait a sec, Hyperglot can do it for us! If you open the Export pane of the program, you will find instructions for generating a compiler:

Assuming your grammar is free of errors, an archive file will be generated in your user home directory. Once you have installed the requisite libraries using npm install, all you need to do is supply a source file and feed it into the compiler.

Congratulations, you've joined the "I've written my own programming language" club. Here's a few things you can think and do about to continue your journey:

  • implement native javascript functions that perform selection, iteration and other control statements
  • implement foreign function interfaces to JavaScript
  • clean the slate and start writing another python, ruby or lua
  • examine the source of the generated compiler — how would you optimize it?

Thanks for reading!