As a programmer, Compilers have always seemed to me like a million line black box only out-daunted by making an operating system. But hard challenges are the best challenges, so a while ago I set out to try and make on myself.
OK.
If you want to write a compiler, there are three main parts. The Lexer, the Parser, and the code generator. I've started this project in a variety of languages including Java and C# but my successful implementation is currently in JavaScript.
1) Lexing
The process of lexing or, Lexical Analysis is, relative to the rest of this process is actually very straightforward. Consider the following code:
const hello = "Hello, " + "World!";
const sum = 4 + 5;
When lexing a piece of code, you must go through the entire source and convert the string into a collection of Tokens. Tokens are simple structures that store information about a small sliver of the source code. For the lexer that I wrote, I use four main Token types: Keyword
, Word
, String
, and Symbol
. So the code above might look like something this after lexing:
Keyword<"const">
Word<"hello">
Symbol<"=">
String<"Hello, ">
Symbol<"+">
String<"World">
Symbol<";">
Keyword<"const">
Word<"sum">
Symbol<"=">
Word<"4">
Symbol<"+">
Word<"5">
Symbol<";">
If you've made it this far, then Awesome!
My project, Mantle, makes this super* to do through an abstract class you can extend called mantle.lexer.Lexer
. You simply define a list of keywords, symbols, and string delimiters, tell it whether to allow comments or not, and pass a function that defines if a character can be used in a word. After that, creating the list above becomes as easy as calling Lexer.parse()
but moving on, you will almost never call parse()
yourself.
More on mantle can be found at https://github.com/Nektro/mantle.js
2) Parsing
This is the hard part.
Parsing requires you to figure out patterns of tokens that can compress the token list into a single node. This took a lot of trial and error to get right, and is the main reason why this project took so long.
For instance for the code we had above we might define the following rules:
Add <= String + String
Add <= Integer + Integer
AssignmentConst <= const Word = Add
StatementList <= Add Add
There are more complex rules, the more complex the language which I discovered very soon.
The JSON example for mantle.parser.Parser
can be found at https://github.com/Nektro/mantle.js/blob/master/langs/mantle-json.js
3) Code generation
This is the process of going through your final condensed node, also called an Abstract Syntax Tree, and toString()
ing them all until you get your new output.
Note:
Optimization of higher-level languages requires a lot more work than calling toString(), but is way above my scope
4) Corgi - my new HTML Preprocessor
At this point I was ecstatic. I successfully made a JSON parser. But I wanted to make something a little more complicated. So I moved onto HTML. The thing is though, HTML isn't very well formed. So I thought I'd make a version that's a little easier for Mantle to parse. And that's how a came onto Corgi.
Corgi syntax is inspired by Pug but isn't tab based so you can compress a file onto one line theoretically. I loved this because forcing the tab structure made using cosmetic HTML tags in Pug really awkward. So Corgi makes HTML great for structure and style.
An example Corgi document would look like:
doctype html
html(
head(
title("Corgi Example")
meta[charset="UTF-8"]
meta[name="viewport",content="width=device-width,initial-scale=1"]
)
body(
h1("Corgi Example")
p("This is an example HTML document written in "a[href="https://github.com/corgi-lang/corgi"]("Corgi")".")
p("Follow Nektro on Twitter @Nektro")
)
)
Closing
Making compilers is hard but has definitely been fun and I hope this helps demystifies them some.
And now I also have an HTML Proprocessor I'm going to use in as many projects as it makes sense.
Resources:
Follwo me: