Yep. That happened. It started out innocently enough. I wanted to make a simple parser for mathematical expressions (actually, closed-form expressions if you go by Wikipedia, but I’m being pedantic). Add, subtract, multiply, divide, exponent, and simple functions. What I ended up with was a complete functional programming language.
However, I won’t be going into the language in this post. That’s for next time. This post – part zero – is the backdrop. It all started as a simple goal…
The Seed that Started It All
There were three simple types of expressions I wanted to be able to parse:
a op b
( expr )
The brackets obviously would determine the precedence, at least on paper. The question was, how would the implementation know how to process these expressions? I decided almost immediately that I would use a stack machine to read postfix notation, mainly because Java bytecode operates the same way. I also decided to write the thing in Java, mainly because of its regex library (and certainly not for its conciseness!).
But, before I could even start coding, the first snag hit me: how in the world was I going to turn infix
(3 + 5) into postfix
(3 5 +)? Especially with complex expressions. How would you know that
(3 * 2 + 5 ** 7) is supposed to be turned into
(3 2 * 5 7 ** +)? As it turns out, there is a relatively simple algorithm one can use to turn infix into postfix called shunting yard.
Now, until this point, I had no idea what shunting yard was, or that it even existed. Parsing was just a magic black box that took in text and output instructions. But, after a few minutes of Wikipedia and Stack Overflow, I thought I had enough to get started.
The way shunting yard works is fairly simple. You have two stacks, one for output, and one for storing tokens. At first, you try to push a token on the storage stack. But, if that token doesn’t have a higher precedence than the token on top of the stack (like if
+ tried to be on top of
*), then tokens are popped off the storage stack and onto the output stack until your current token is of higher precedence. There’s a few complications with parentheses and associativity, but that’s the gist.
Of course, there is one thing I haven’t mentioned yet. If this is a functional language, then there must be functions. I decided that functions are actually prefix operators (!), taking an arbitrary number of arguments. But how does one parse this construct?
f(a, b, c, ...)
This bothered me for a while, until I realized I could hack it all together with one thing: precedence. Enter the comma operator: it does exactly nothing to its arguments and has the lowest precedence, meaning it can be used to separate expressions. Add in parentheses to group the comma’d expressions together, and you can slap a function name at the front, ready-made.
In fact, the same precedence-hacking trick can be used to make infix operators that take more than two inputs as well. For example, the infix reduce function (more on that in a later post!) is called like this:
seq reduce (id, func), with the reduce between its first and second arguments. Actually, it can also be called as
(seq, id) reduce func because the way shunting yard works, the operator can appear anywhere in between the arguments. Take that however you may.
By this point, I had a working expression parser I could call from the command line. Testing the program with arithmetic was fun, but then I had an idea. What would happen if functions could be used as values? And the rest, as the saying goes, is history.
Stay tuned for part one, where that picture at the top will be fully explained.