zoom.nu: Parser Combinators in Python, Part 3

This is the third and final post in the Parser Combinators in Python series.
Part 1
Part 2

The only remaining part is now to get the parsers to look a bit nicer. This will of necessity be a much less objective post, since this is about aesthetically pleasing syntax, which is a never-ending source for fist fights.

Currently, the parser for multiplication expressions looks like this:
mul = Choice(Sequence(Sequence(simple, Exact("*")).apply(left), expr_rec).apply(Mul), simple)

First, let's clean up the Choice and Sequence parts by using operator overloading on the Parser base class, like this:


class Parser(object):
        def __or__(self, other):
                return Choice(self, other)

        def __and__(self, other):
                return Sequence(self, other)

The multiplication parser can now be written as:
mul = ((simple & Exact("*")).apply(left) & expr_rec).apply(Mul) | simple

The same goes for the addition parser. The parenthesis parser can also be simplified, from Sequence(Sequence(Exact("("), expr_rec).apply(right), Exact(")")).apply(left) to ((Exact("(") & expr_rec).apply(right) & Exact(")")).apply(left). One thing that can be noted here is that there are a lot of Sequence combinators with left or right applied. That is, there are a lot of little things in the grammar (like the "(" and ")" in the parenthesis parser) that is necessary for the parsing but that we don't really care about in the AST. Let's add some syntax to get rid of them. I'll abuse the shift operators for sequences where on ly one result is used, and I'll make them be arrows pointing to the interesting part. That is, A >> B means "parse A, then parse B. Throw away the result from A, and return only the result from B". The additions to the Parser class are:


        def __lshift__(self, other):
                return Sequence(self, other, False, left)

        def __rshift__(self, other):
                return Sequence(self, other, False, right)

The parsers can now be written as:


expr_rec=Recurse()
number = RegEx("0|([1-9][0-9]*)").apply(Number)
paren = Exact("(") >> expr_rec << Exact(")")
simple = number | paren
mul = (simple << Exact("*") & expr_rec).apply(Mul) | simple
add = (mul << Exact("+") & expr_rec).apply(Add) | mul
expr = add
expr_rec.set(expr)

The two words that stick out here are apply and Exact. Let's get rid of Exact by special-casing strings in the operators, wrapping the other argument in an Exact parser as necessary:


        def __or__(self, other):
                if other.__class__==types.StringType:
                        return Choice(self, Exact(other))
                else:
                        return Choice(self, other)
        def __and__(self, other):
                if other.__class__==types.StringType:
                        return Sequence(self, Exact(other))
                else:
                        return Sequence(self, other)
        def __lshift__(self, other):
                if other.__class__==types.StringType:
                        return Sequence(self, Exact(other), False, left)
                else:
                        return Sequence(self, other, False, left)
        def __rshift__(self, other):
                if other.__class__==types.StringType:
                        return Sequence(self, Exact(other), False, right)
                else:
                        return Sequence(self, other, False, right)

Since the strings can sometimes be to the left of the operator, we need right-hand versions of the operators as well. I'll just show one, since they are all on the same form.


        def __rlshift__(self, other):
                if other.__class__==types.StringType:
                        return Sequence(Exact(other), self, False, left)
                else:
                        return Sequence(other, self, False, left)

The parsers are now down to


expr_rec=Recurse()
number = RegEx("0|([1-9][0-9]*)").apply(Number)
paren = "(" >> expr_rec << ")"
simple = number | paren
mul = (simple << "*" & expr_rec).apply(Mul) | simple
add = (mul << "+" & expr_rec).apply(Add) | mul
expr = add
expr_rec.set(expr)

Which is pretty sweet if you'd ask me. In languages that let you invent new operators, you could avoid overloading "and" etc. with new meaning, and also have something nice for apply. If we'd abuse the == operator for this, we'd get


expr_rec=Recurse()
number = RegEx("0|([1-9][0-9]*)")   == Number
paren = "(" >> expr_rec << ")"
simple = number | paren
mul_op = simple << "*" & expr_rec   == Mul
mul = mul_op | simple
add_op = mul << "+" & expr_rec      == Add
add = add_op | mul
expr = add
expr_rec.set(expr)

This might arguably look better (the AST node creation really stands out), but is just too wrong for my taste.

So, we've come to road's end here. There are limits to how close you get get Python to BNF. Still, you can come close enough to have parsers that look a fair bit like their grammars, and that also are reasonably efficient. We also get the side benefit of being able to handle ambiguous grammars. In conclusion: the next time you write a parser, take a look at your friendly neighborhood parser combinator library. For Java, check out JParsec. For Python, there's PyParsing. There's implementations for most other languages as well. Enjoy!

zoom.nu

2010-05-29

Parser Combinators in Python, Part 3

No comments:

Post a Comment

Blog Archive