2010-05-20

Parser Combinators in Python, Part 2

This is part two of a three-part series on Parser Combinators in Python

My earlier blog post about parser combinators in Python ended with a real cliffhanger. How can the results be made into an AST instead of a list of tokens? What about the promised BNF-like syntax when building parsers? The arithmetic expression parser used as an example was quite far from that goal. Fear not, I'm back to solve at least the first of those two mysteries in this post.
How can we get a more useful result? Currently, all the parsers return lists of tokens (plain strings, actually), and Sequence and Choice concatenate lists. Ideally, we'd like something like a tree of AST elements, some representing operators, and some representing numbers.
So some parsers will return tokens or other objects representing some kind of terminal symbol, and some will return some kind of structures. The user will likely want to decide what kind of objects are used for the different terminals and which structures are used to combine them, and what parts should be used and what should be ignored. Let's add support for having the parsers apply a user-defined function to the parse result to the Parser base class:


class Parser(object):
        def apply(self, f):
                self.f=f
                return self

...and then add calls to self.f to all parse methods. For Exact it becomes:


def id(x):
        return x

class Exact(Parser):
        def __init__(self, text, f=id):
                Parser.__init__(self)
                self.text=text
                self.f=f

        def parse(self, s, pos):
                if s.startswith(self.text, pos):
                        yield (self.f(self.text), pos+len(self.text))

The user can now write Exact("Hej").apply(SwedishToEnglish), which will return "Hi" given a suitable SwedishToEnglish implementation. RegEx is similarly simple:


class RegEx(Parser):
        def __init__(self, to_match, f=id):
                Parser.__init__(self)
                self.re=re.compile(to_match)
                self.f=f
        def parse(self, s, pos):
                m=self.re.match(s, pos)
                if m:
                        yield (self.f(m.group(0)), m.end())

This allows things like RegEx("\\d+").apply(string.atoi), which is highly useful.
For Choice, it's also simple: just yield what parser A returned with self.f applied, then do the same for parser B:


class Choice(Parser):
        def __init__(self, a, b, f=id):
                Parser.__init__(self)
                self.a=a
                self.b=b
                self.f=f

        def parse(self, s, pos):
                for (a_result, a_pos) in self.a.parse(s, pos):
                        yield (self.f(a_result), a_pos)
                for (b_result, b_pos) in self.b.parse(s, pos):
                        yield (self.f(b_result), b_pos)

So what about Sequence? Since parser functions now return something user-defined as the result instead of a list of tokens, we cannot concatenate the results of parser A and parser B. The Sequence parser combinator is the one that will be used to build structures (e.g. number A followed by operator O, followed by number B becomes a structure saying that operator O should be applied to number A and number B). We're just adding a default here, since the user will override this for all interesting structures. So let's use the simplest possible structure as a default: the tuple.

def pair(a, b):
        return (a,b)

class Sequence(Parser):
        whitespace=re.compile("[ \\t\\n]+")
        def __init__(self, a, b, keep_whitespace=False, f=pair):
                Parser.__init__(self)
                self.a=a
                self.b=b
                self.keep_whitespace=keep_whitespace
                self.f=f

        def parse(self, s, pos):
                for (a_result, a_pos) in self.a.parse(s, pos):
                        if not self.keep_whitespace:
                                m=Sequence.whitespace.match(s, a_pos)
                                if m:
                                        a_pos=m.end()
                        for (b_result, b_pos) in self.b.parse(s, a_pos):
                                yield (self.f(a_result, b_result), b_pos)

I leave the function application changes to the other parsers as an exercise for the reader.
Now, let's use this in the example parser. As I'm sure you all noticed, there was a fault in the operator precedence in the previous post: addition bound harder than multiplication. This is fixed now. I've also added some classes for holding the AST, and they also have an eval method to evaluate the subexpression they represent. The example parser thus becomes:


def left(a, b):
        return a

def right(a, b):
        return b

class Add(object):
        def __init__(self, a, b):
                self.a=a
                self.b=b
        def eval(self):
                return self.a.eval() + self.b.eval()
        def __repr__(self):
                return "Add<%s,%s>" % (self.a, self.b)

class Mul(object):
        def __init__(self, a, b):
                self.a=a
                self.b=b
        def eval(self):
                return self.a.eval() * self.b.eval()
        def __repr__(self):
                return "Mul<%s,%s>" % (self.a, self.b)

class Number(object):
        def __init__(self, s):
                self.n=int(s)
        def eval(self):
                return self.n
        def __repr__(self):
                return str(self.n)

expr_rec=Recurse()
number = RegEx("0|([1-9][0-9]*)").apply(Number)
paren = Sequence(Sequence(Exact("("), expr_rec).apply(right), Exact(")")).apply(left)
simple = Choice(number, paren)
mul = Choice(Sequence(Sequence(simple, Exact("*")).apply(left), expr_rec).apply(Mul), simple)
add = Choice(Sequence(Sequence(mul, Exact("+")).apply(left), expr_rec).apply(Add), mul)
expr = add
expr_rec.set(expr)

s="1  +2\t*(4+3*2 )"
for (result,pos) in expr.parse(s, 0):
        print "Using the first %d characters" % pos
        print s[:pos]
        print result
        print result.eval()

Which, as expected, produces the following results:


Using the first 15 characters
1  +2 *(4+3*2 )
Add<1,Mul<2,Add<4,Mul<3,2>>>>
21
Using the first 5 characters
1  +2
Add<1,2>
3
Using the first 1 characters
1
1
1

That's all for this time. In the next, and final, episode, we'll take a look at how to make the parser look pretty by abusing operator overloading.

3 comments:

  1. Again, thanks a lot. This is very clear.

    Links to previous and next parts:
    Part 1: http://blog.zoom.nu/2010/05/parser-combinators-in-python-part-1.html
    Part 3: http://blog.zoom.nu/2010/05/parser-combinators-in-python-part-3.html

    ReplyDelete
  2. Oops, bad information is worse than no information; apologies for the wrong link to Part 1: it's at
    http://blog.zoom.nu/2010/05/parser-combinators-in-python.html

    (Please feel free to delete my comments.)

    ReplyDelete
  3. Thank you very much for reading it. I've updated part one and two with links to the other posts as you suggested. I hope the series was of help to you!

    ReplyDelete

Note: only a member of this blog may post a comment.