# Minicompiler: Lexing

I've always wanted to make my own compiler. Compilers are an integral part of my day to day job and I use the fruits of them constantly. A while ago while I was browsing through the TempleOS source code I found MiniCompiler.HC in the `::/Demos/Lectures` folder and I was a bit blown away. It implements a two phase compiler from simple math expressions to AMD64 bytecode (complete with bit-banging it to an array that the code later jumps to) and has a lot to teach about how compilers work. For those of you that don't have a TempleOS VM handy, here is a video of MiniCompiler.HC in action:

You put in a math expression, the compiler builds it and then spits out a bunch of assembly and runs it to return the result. In this series we are going to be creating an implementation of this compiler that targets WebAssembly. This compiler will be written in Rust and will use only the standard library for everything but the final bytecode compilation and execution phase. There is a lot going on here, so I expect this to be at least a three part series. The source code will be in Xe/minicompiler in case you want to read it in detail. Follow along and let's learn some Rust on the way! Mara

Compilers for languages like C are built on top of the fundamentals here, but they are much more complicated.

## Description of the Language

This language uses normal infix math expressions on whole numbers. Here are a few examples:

• `2 + 2`
• `420 * 69`
• `(34 + 23) / 38 - 42`
• `(((34 + 21) / 5) - 12) * 348`

Ideally we should be able to nest the parentheses as deep as we want without any issues.

Looking at these values we can notice a few patterns that will make parsing this a lot easier:

• There seems to be only 4 major parts to this language:
• numbers
• math operators
• open parentheses
• close parentheses
• All of the math operators act identically and take two arguments
• Each program is one line long and ends at the end of the line

Let's turn this description into Rust code:

## Bringing in Rust

Make a new project called `minicompiler` with a command that looks something like this:

``````\$ cargo new minicompiler
``````

This will create a folder called `minicompiler` and a file called `src/main.rs`. Open that file in your editor and copy the following into it:

``````// src/main.rs

/// Mathematical operations that our compiler can do.
#[derive(Debug, Eq, PartialEq)]
enum Op {
Mul,
Div,
Sub,
}

/// All of the possible tokens for the compiler, this limits the compiler
/// to simple math expressions.
#[derive(Debug, Eq, PartialEq)]
enum Token {
EOF,
Number(i32),
Operation(Op),
LeftParen,
RightParen,
}
`````` Mara

In compilers, "tokens" refer to the individual parts of the language you are working with. In this case every token represents every possible part of a program.

And then let's start a function that can turn a program string into a bunch of tokens:

``````// src/main.rs

fn lex(input: &str) -> Vec<Token> {
todo!("implement this");
}
`````` Mara

Wait, what do you do about bad input such as things that are not math expressions? Shouldn't this function be able to fail?

You're right! Let's make a little error type that represents bad input. For creativity's sake let's call it `BadInput`:

``````// src/main.rs

use std::error::Error;
use std::fmt;

/// The error that gets returned on bad input. This only tells the user that it's
/// wrong because debug information is out of scope here. Sorry.
#[derive(Debug, Eq, PartialEq)]

// Errors need to be displayable.
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
}
}

// The default Error implementation will do here.
``````

And then let's adjust the type of `lex()` to compensate for this:

``````// src/main.rs

fn lex(input: &str) -> Result<Vec<Token>, BadInput> {
todo!("implement this");
}
``````

So now that we have the function type we want, let's start implementing `lex()` by setting up the result and a loop over the characters in the input string:

``````// src/main.rs

fn lex(input: &str) -> Result<Vec<Token>, BadInput> {
let mut result: Vec<Token> = Vec::new();

for character in input.chars() {
todo!("implement this");
}

Ok(result)
}
``````

Looking at the examples from earlier we can start writing some boilerplate to turn characters into tokens:

``````// src/main.rs

// ...

for character in input.chars() {
match character {
// Skip whitespace
' ' => continue,

// Ending characters
';' | '\n' => {
result.push(Token::EOF);
break;
}

// Math operations
'*' => result.push(Token::Operation(Op::Mul)),
'/' => result.push(Token::Operation(Op::Div)),
'-' => result.push(Token::Operation(Op::Sub)),

// Parentheses
'(' => result.push(Token::LeftParen),
')' => result.push(Token::RightParen),

// Numbers
'0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' => {
todo!("implement number parsing")
}

// Everything else is bad input
}
}

// ...
`````` Mara

Ugh, you're writing `Token::` and `Op::` a lot. Is there a way to simplify that?

Yes! enum variants can be shortened to their names with a `use` statement like this:

``````// src/main.rs

// ...

use Op::*;
use Token::*;

match character {
// ...

// Math operations
'*' => result.push(Operation(Mul)),
'/' => result.push(Operation(Div)),
'-' => result.push(Operation(Sub)),

// Parentheses
'(' => result.push(LeftParen),
')' => result.push(RightParen),

// ...
}

// ...
``````

Which looks a lot better. Mara

You can use the `use` statement just about anywhere in your program. However to keep things flowing nicer, the `use` statement is right next to where it is needed in these examples.

Now we can get into the fun that is parsing numbers. When he wrote MiniCompiler, Terry Davis used an approach that is something like this (spacing added for readability):

``````case '0'...'9':
i = 0;
do {
i = i * 10 + *src - '0';
src++;
} while ('0' <= *src <= '9');
*num=i;
``````

This sets an intermediate variable `i` to 0 and then consumes characters from the input string as long as they are between `'0'` and `'9'`. As a neat side effect of the numbers being input in base 10, you can conceptualize `40` as `(4 * 10) + 2`. So it multiplies the old digit by 10 and then adds the new digit to the resulting number. Our setup doesn't let us get that fancy as easily, however we can emulate it with a bit of stack manipulation according to these rules:

• If `result` is empty, push this number to result and continue lexing the program
• Pop the last item in `result` and save it as `last`
• If `last` is a number, multiply that number by 10 and add the current number to it
• Otherwise push the node back into `result` and push the current number to `result` as well

Translating these rules to Rust, we get this:

``````// src/main.rs

// ...

// Numbers
'0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' => {
let num: i32 = (character as u8 - '0' as u8) as i32;
if result.len() == 0 {
result.push(Number(num));
continue;
}

let last = result.pop().unwrap();

match last {
Number(i) => {
result.push(Number((i * 10) + num));
}
_ => {
result.push(last);
result.push(Number(num));
}
}
}

// ...
`````` Mara

This is not the most robust number parsing code in the world, however it will suffice for now. Extra credit if you can identify the edge cases!

This should cover the tokens for the language. Let's write some tests to be sure everything is working the way we think it is!

## Testing

Rust has a robust testing framework built into the standard library. We can use it here to make sure we are generating tokens correctly. Let's add the following to the bottom of `main.rs`:

``````#[cfg(test)] // tells the compiler to only build this code when tests are being run
mod tests {
use super::{Op::*, Token::*, *};

// registers the following function as a test function
#[test]
fn basic_lexing() {
assert!(lex("420 + 69").is_ok());
assert!(lex("tacos are tasty").is_err());

assert_eq!(
lex("420 + 69"),
);
assert_eq!(
lex("(30 + 560) / 4"),
Ok(vec![
LeftParen,
Number(30),
Number(560),
RightParen,
Operation(Div),
Number(4)
])
);
}
}
``````

This test can and probably should be expanded on, but when we run `cargo test`:

``````\$ cargo test

Finished test [unoptimized + debuginfo] target(s) in 0.22s

running 1 test
test tests::basic_lexing ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
``````

And hey presto! We verified that all of the parsing is working correctly. Those test cases should be sufficient to cover all of the functionality of the language.

This is it for part 1. We covered a lot today. Next time we are going to run a validation pass on the program, convert the infix expressions to reverse polish notation and then also get started on compiling that to WebAssembly. This has been fun so far and I hope you were able to learn from it.

Special thanks to the following people for reviewing this post:

• Steven Weeks
• sirpros
• Leonora Tindall
• Chetan Conikee
• Pablo
• boopstrap
• ash2x3

This article was posted on M10 29 2020. Facts and circumstances may have changed since publication. Please contact me before jumping to conclusions if something seems wrong or unclear.

Series: rust

Tags: `rust` `templeos` `compiler`

The art for Mara was drawn by Selicre.