Programming languages implementation is a well defined sequence of very
specific stages. It may include parsing, followed by a chain of rewrites of an
AST, interleaved with some analysis passes over the said AST, ending with a code
generation, which may also be just a rewrite pass. It does not matter if a
language is domain specific, a small macro extending somewhat a
functionality of some host language, or a large standalone language - all follow
the same simple pattern.
In order to simplify this process as much as possible it makes sense to use
domain-specific languages for each distinct stage. Parsing DSLs are ubiqutous,
but often all the consequent transformations are written in an ad hoc way and
result in an unnecessarily complicated code.
Here we're presenting a small experimental language, PFront, demonstrating this
concept of a pervasive use of DSLs for constructing all stages of a compiler (or
a language extension, a macro, a eDSL, etc.). It consists of the following eDSLs
embedded into one simple extensible host language:
A parsing eDSL. Nothing really new here, besides it being very tightly
integrated and allowing to implement an extensible syntax (thanks to PEG) and
parser inheritance. A lot of useful parsing components are already defined and
can can be reused.
A domain-specific type system for defining ASTs and infering visitors over
them. This is the key component, allowing to write compiler passes in a mostly
declarative way. E.g., if you want to walk over your AST and collect all the
identifiers, taking a lexical scope into account, you only have to mention the
identifier nodes and nodes introducing new scope (which affects the tree
traversing order), no need to write a huge recursive function going through
all possible nodes in this AST and no need to write a generic visitor for this
AST (remember, the order is not necessarily trivial).
A set of abstract, very generic IRs and predefined passes over them, useful
for exctracting an essence of your domain specific AST, doing some transforms
in an abstracted way and then reapplying the result back to your specific
AST. For example, there is a generic liveness IR that can be used for a
liveness analysis and a register allocation.
There is also a generic SSA IR, with a rich set of transformation and analysis
passes (think of LLVM, but without LL).
An embedded Prolog compiler, which is very useful for some compiler
analysis passes and for implementing type systems seamlessly (especially the
Hindley-Milner based ones).
Some minor bells and whistles that capture recurring patterns in compiler passes.
Built-in literate programming tools.
Be warned. This language is a proof of concept only and not recommended for any
production use. It's a hack upon a pile of hacks, glued by hacks and shit. The
good news is that such a language is relatively trivial and anyone can implement
a similar set of features on top of their host language of a choice.
Constructing a simple language
For example, a parser for a simple language looks like this:
parsercalc(pfront){// Define a new parser, inheriting from the pfront syntax !!Spaces;// Ignore the whitespace characters and comments
calc ⇐ [calc0]:c[Spaces]* ⇒ c;// Toplevel entry parses a single calc0 expression, // which may be trailed by some spaces. binarycalc0 ⇐ // Define a binary expression parser (using Pratt algorithm) (200)[calc0]"*"[calc0] ⇒ mult(L,R) // Here, (200) is a priority, [calc0] is a recurisve reference to the same entry, // "*" is a literal, and '=> mult(L,R)' is an AST constructor. // By default, left associativity is assumed. |(200)[calc0]"/"[calc0] ⇒ div(L,R) |(100)[calc0]"+"[calc0] ⇒ plus(L,R) |(100)[calc0]"-"[calc0] ⇒ minus(L,R) |[atom] ; // All the other nodes are parsed by Packrat and are defined with a PEG syntax atom ⇐ {"("[expr]:e")" ⇒ e} /{let[ident]:nm"="[calc]:vin[calc]:body
⇒ let(nm,v,body)} /{[ident]:nm ⇒ var(nm)} /{[double]:v ⇒ const(v)} ;
// This is a lexerless parsing, so "tokens" must be defined with the same PEG double ⇐ [tkdouble]:v ⇒ {ctoken=const}$fval(v); @tkdouble ⇐ ("-"/"+")?[Digit]+("."[Digit]+)?; }
A type declaration for the AST generated by this parser is following:
Now we can write an interpreter for this simple language:
functioncalc_eval(env,e) visit:calc(expr:e){ onceexpr{ let → calc_eval([nmlet(nm,val,body);calc_eval(env,vallet(nm,val,body))]:env,bodylet(nm,val,body)) |deep → { plus → %f+function f+ a, b:
Floating point Add(aplus(a,b),bplus(a,b)) |minus → %f-function f- a, b:
Floating point Sub(aminus(a,b),bminus(a,b)) |mult → %f*function f* a, b:
Floating point Mul(amult(a,b),bmult(a,b)) |div → %f/function f/ a, b:
Floating point Div(adiv(a,b),bdiv(a,b)) |const → vconst(v) |var → %lookup-env-car(env,nmvar(nm)) |else → ∅}}}
This is an example of a visitor with a nontrivial traversal order. 'visit:calc'
infers a visitor function for the 'calc' AST, starting with the 'expr' entry
node. 'once expr' defines a rewrite pattern for a single tree node level,
without going into sub-nodes. We need this because 'let' is introducing a new
variable scope. But, all the other nodes but 'let' can be traversed in a
depth-first order, therefore they go under 'deep', and a recursive depth-first
visitor is inferred here. We only had to add an explicit recursion for the 'let'
nodes.
Writing a compiler for the same language is even easier (translating it to an
underlying Lisp) - depth-first traversal is sufficient so we do not need any
explicit recursion here at all:
syntaxinexpr,start(calc): ' "calc:"[calc]:c ' {return'lisp'(%S<<macro S<< args:
A short form for [(buildstring ...)](calc_eval(∅,c)))}
And this new syntax can be used straight away, in the same source file, even with syntax highlighing working out of the box:
writelinemacro writeline args:
Prints a string of arguments into a standard output,
using the [to-string] function to print each value.(calcnet:letx=2in2*2+x*3)
Of course, the language we just implemented is extremely trivial, and
compilation consists of one stage only, so the example above is not very
convincing. In a more realistic scenario there must be a long chain of
transformations, with multiple different (often just slightly different)
ASTs. And this is exactly what PFront is designed for.
You can see a more complex language implementation (a C-like language translated
to LLVM directly or an own LLVM-like IR) here (with a
literate code here).