## 2010-05-29

### Parser Combinators in Python, Part 3

This is the third and final post in the Parser Combinators in Python series.
Part 1
Part 2

The only remaining part is now to get the parsers to look a bit nicer. This will of necessity be a much less objective post, since this is about aesthetically pleasing syntax, which is a never-ending source for fist fights.

Currently, the parser for multiplication expressions looks like this:
`mul = Choice(Sequence(Sequence(simple, Exact("*")).apply(left), expr_rec).apply(Mul), simple)`

First, let's clean up the `Choice` and `Sequence` parts by using operator overloading on the `Parser` base class, like this:

``class Parser(object):        def __or__(self, other):                return Choice(self, other)        def __and__(self, other):                return Sequence(self, other)``

The multiplication parser can now be written as:
`mul = ((simple & Exact("*")).apply(left) & expr_rec).apply(Mul) | simple`

The same goes for the addition parser. The parenthesis parser can also be simplified, from `Sequence(Sequence(Exact("("), expr_rec).apply(right), Exact(")")).apply(left)` to `((Exact("(") & expr_rec).apply(right) & Exact(")")).apply(left)`. One thing that can be noted here is that there are a lot of `Sequence` combinators with `left` or `right` applied. That is, there are a lot of little things in the grammar (like the "(" and ")" in the parenthesis parser) that is necessary for the parsing but that we don't really care about in the AST. Let's add some syntax to get rid of them. I'll abuse the shift operators for sequences where on ly one result is used, and I'll make them be arrows pointing to the interesting part. That is, `A >> B` means "parse A, then parse B. Throw away the result from A, and return only the result from B". The additions to the `Parser` class are:

``        def __lshift__(self, other):                return Sequence(self, other, False, left)        def __rshift__(self, other):                return Sequence(self, other, False, right)``

The parsers can now be written as:

``expr_rec=Recurse()number = RegEx("0|([1-9][0-9]*)").apply(Number)paren = Exact("(") >> expr_rec << Exact(")")simple = number | parenmul = (simple << Exact("*") & expr_rec).apply(Mul) | simpleadd = (mul << Exact("+") & expr_rec).apply(Add) | mulexpr = addexpr_rec.set(expr)``

The two words that stick out here are `apply` and `Exact`. Let's get rid of `Exact` by special-casing strings in the operators, wrapping the `other` argument in an `Exact` parser as necessary:

``        def __or__(self, other):                if other.__class__==types.StringType:                        return Choice(self, Exact(other))                else:                        return Choice(self, other)        def __and__(self, other):                if other.__class__==types.StringType:                        return Sequence(self, Exact(other))                else:                        return Sequence(self, other)        def __lshift__(self, other):                if other.__class__==types.StringType:                        return Sequence(self, Exact(other), False, left)                else:                        return Sequence(self, other, False, left)        def __rshift__(self, other):                if other.__class__==types.StringType:                        return Sequence(self, Exact(other), False, right)                else:                        return Sequence(self, other, False, right)``

Since the strings can sometimes be to the left of the operator, we need right-hand versions of the operators as well. I'll just show one, since they are all on the same form.

``        def __rlshift__(self, other):                if other.__class__==types.StringType:                        return Sequence(Exact(other), self, False, left)                else:                        return Sequence(other, self, False, left)``

The parsers are now down to
`expr_rec=Recurse()number = RegEx("0|([1-9][0-9]*)").apply(Number)paren = "(" >> expr_rec << ")"simple = number | parenmul = (simple << "*" & expr_rec).apply(Mul) | simpleadd = (mul << "+" & expr_rec).apply(Add) | mulexpr = addexpr_rec.set(expr)`

Which is pretty sweet if you'd ask me. In languages that let you invent new operators, you could avoid overloading "and" etc. with new meaning, and also have something nice for `apply`. If we'd abuse the `==` operator for this, we'd get

``expr_rec=Recurse()number = RegEx("0|([1-9][0-9]*)")   == Numberparen = "(" >> expr_rec << ")"simple = number | parenmul_op = simple << "*" & expr_rec   == Mulmul = mul_op | simpleadd_op = mul << "+" & expr_rec      == Addadd = add_op | mulexpr = addexpr_rec.set(expr)``

This might arguably look better (the AST node creation really stands out), but is just too wrong for my taste.

So, we've come to road's end here. There are limits to how close you get get Python to BNF. Still, you can come close enough to have parsers that look a fair bit like their grammars, and that also are reasonably efficient. We also get the side benefit of being able to handle ambiguous grammars. In conclusion: the next time you write a parser, take a look at your friendly neighborhood parser combinator library. For Java, check out JParsec. For Python, there's PyParsing. There's implementations for most other languages as well. Enjoy!