DSLs for building compilers

Programming languages implementation is a well defined sequence of very specific stages. It may include parsing, followed by a chain of rewrites of an AST, interleaved with some analysis passes over the said AST, ending with a code generation, which may also be just a rewrite pass. It does not matter if a language is domain specific, a small macro extending somewhat a functionality of some host language, or a large standalone language - all follow the same simple pattern.

In order to simplify this process as much as possible it makes sense to use domain-specific languages for each distinct stage. Parsing DSLs are ubiqutous, but often all the consequent transformations are written in an ad hoc way and result in an unnecessarily complicated code.

Here we're presenting a small experimental language, PFront, demonstrating this concept of a pervasive use of DSLs for constructing all stages of a compiler (or a language extension, a macro, a eDSL, etc.). It consists of the following eDSLs embedded into one simple extensible host language:

Be warned. This language is a proof of concept only and not recommended for any production use. It's a hack upon a pile of hacks, glued by hacks and shit. The good news is that such a language is relatively trivial and anyone can implement a similar set of features on top of their host language of a choice.

Constructing a simple language

For example, a parser for a simple language looks like this:

parser calc (pfront) {  // Define a new parser, inheriting from the pfront syntax
   !!Spaces;  // Ignore the whitespace characters and comments

   // Syntax highlighting rules:
   [lexical:] ⇐ [lexical] ⇒ {ctoken = lexic};
   [keyword:] ⇐ [keyword] ![IdentRest] ⇒ {ctoken = keyword};
   
   calc ⇐ [calc0]:c [Spaces]* ⇒ c; // Toplevel entry parses a single calc0 expression,
                                     // which may be trailed by some spaces.
   binary calc0 ⇐  // Define a binary expression parser (using Pratt algorithm)
         (200) [calc0] "*" [calc0] ⇒ mult(L,R)
         // Here, (200) is a priority, [calc0] is a recurisve reference to the same entry,
         // "*" is a literal, and '=> mult(L,R)' is an AST constructor.
         // By default, left associativity is assumed.
       | (200) [calc0] "/" [calc0] ⇒ div(L,R)
       | (100) [calc0] "+" [calc0] ⇒ plus(L,R)
       | (100) [calc0] "-" [calc0] ⇒ minus(L,R)
       | [atom]
       ;
   // All the other nodes are parsed by Packrat and are defined with a PEG syntax       
   atom ⇐ { "(" [expr]:e ")" ⇒ e }
        /  { let [ident]:nm "=" [calc]:v in [calc]:body
                  ⇒ let(nm, v, body) }
        /  { [ident]:nm ⇒ var(nm) }
        /  { [double]:v ⇒ const(v) }
        ;

   // This is a lexerless parsing, so "tokens" must be defined with the same PEG
   double ⇐ [tkdouble]:v ⇒ {ctoken = const} $fval(v);
   @tkdouble ⇐ ("-"/"+")? [Digit]+ ("." [Digit]+)?;
}

A type declaration for the AST generated by this parser is following:

ast calc {
   expr = plus(expr:a, expr:b)
        | minus(expr:a, expr:b)
        | mult(expr:a, expr:b)
        | div(expr:a, expr:b)
        | let(ident:nm, expr:val, expr:body)
        | var(ident:nm)
        | const(number:v);
}

Now we can write an interpreter for this simple language:

function calc_eval(env, e)
   visit:calc (expr:e) {
      once expr {
           let → calc_eval([nmlet(nm,val,body);calc_eval(env, vallet(nm,val,body))]:env, bodylet(nm,val,body))
         | deep → {
             plus → %f+function f+ a, b: 
Floating point Add
(aplus(a,b),bplus(a,b))
           | minus → %f-function f- a, b: 
Floating point Sub
(aminus(a,b),bminus(a,b))
           | mult → %f*function f* a, b: 
Floating point Mul
(amult(a,b),bmult(a,b))
           | div → %f/function f/ a, b: 
Floating point Div
(adiv(a,b),bdiv(a,b))
           | const → vconst(v)
           | var → %lookup-env-car(env, nmvar(nm))
           | else → }}}

This is an example of a visitor with a nontrivial traversal order. 'visit:calc' infers a visitor function for the 'calc' AST, starting with the 'expr' entry node. 'once expr' defines a rewrite pattern for a single tree node level, without going into sub-nodes. We need this because 'let' is introducing a new variable scope. But, all the other nodes but 'let' can be traversed in a depth-first order, therefore they go under 'deep', and a recursive depth-first visitor is inferred here. We only had to add an explicit recursion for the 'let' nodes.

Writing a compiler for the same language is even easier (translating it to an underlying Lisp) - depth-first traversal is sufficient so we do not need any explicit recursion here at all:

function calc_compile(ex)
   visit:calc(expr: ex) {
     deep expr {
        const → 'f#'(%->sfunction ->s o: 
Calls the [ToString] method of [o].
(vconst(v)))
      | var → nmvar(nm)
      | let → 'alet'(nmlet(nm,val,body), vallet(nm,val,body), bodylet(nm,val,body))
      | plus → 'f+'(aplus(a,b),bplus(a,b))
      | minus → 'f-'(aminus(a,b),bminus(a,b))
      | mult → 'f*'(amult(a,b),bmult(a,b))
      | div → 'f/'(adiv(a,b),bdiv(a,b))}}

Compiling the same language into a stack machine (namely, .NET IR) is not much more difficult:

function calc_compile_dotnet(ex)
  visit:calc(expr:ex) {
    deep expr {
       const → ['Ldc_R8'(vconst(v))]
     | var → ['Ldloc'('var'(nmvar(nm)))]
     | let → ['local'(nmlet(nm,val,body), t_Double);
               @vallet(nm,val,body);
               'Stloc'('var'(nmlet(nm,val,body)));
               @bodylet(nm,val,body)]
     | plus → [@aplus(a,b);@bplus(a,b);'Add'()]
     | minus → [@aminus(a,b);@bminus(a,b);'Sub'()]
     | mult → [@amult(a,b);@bmult(a,b);'Mul'()]
  | div → [@adiv(a,b);@bdiv(a,b);'Div'()]}}

And, it's trivial to embed this compiler staight into PFront itself, using its extensible syntax:

syntax in expr, start (calc): ' "calcnet:" [calc]:c '
  { return 'lisp'('n.asm'(,@calc_compile_dotnet(c),'Box'(t_Double))) }

Or, for an immediate interpretation:

syntax in expr, start (calc): ' "calc:" [calc]:c '
  { return 'lisp'(%S<<macro S<< args: 
A short form for [(buildstring ...)]
(calc_eval(,c))) }

And this new syntax can be used straight away, in the same source file, even with syntax highlighing working out of the box:

writelinemacro writeline args: 
Prints a string of arguments into a standard output,
using the [to-string] function to print each value.
(calcnet: let x = 2 in 2*2+x*3)

Of course, the language we just implemented is extremely trivial, and compilation consists of one stage only, so the example above is not very convincing. In a more realistic scenario there must be a long chain of transformations, with multiple different (often just slightly different) ASTs. And this is exactly what PFront is designed for.

You can see a more complex language implementation (a C-like language translated to LLVM directly or an own LLVM-like IR) here (with a literate code here).