This is part two of a three-part series on Parser Combinators in Python
My earlier blog post about parser combinators in Python ended with a real cliffhanger. How can the results be made into an AST instead of a list of tokens? What about the promised BNF-like syntax when building parsers? The arithmetic expression parser used as an example was quite far from that goal. Fear not, I'm back to solve at least the first of those two mysteries in this post.
How can we get a more useful result? Currently, all the parsers return lists of tokens (plain strings, actually), and Sequence and Choice concatenate lists. Ideally, we'd like something like a tree of AST elements, some representing operators, and some representing numbers.
So some parsers will return tokens or other objects representing some kind of terminal symbol, and some will return some kind of structures. The user will likely want to decide what kind of objects are used for the different terminals and which structures are used to combine them, and what parts should be used and what should be ignored. Let's add support for having the parsers apply a user-defined function to the parse result to the
Parser base class:
class Parser(object): def apply(self, f): self.f=f return self
...and then add calls to
def id(x): return x class Exact(Parser): def __init__(self, text, f=id): Parser.__init__(self) self.text=text self.f=f def parse(self, s, pos): if s.startswith(self.text, pos): yield (self.f(self.text), pos+len(self.text))
The user can now write
Exact("Hej").apply(SwedishToEnglish), which will return
"Hi"given a suitable
RegExis similarly simple:
class RegEx(Parser): def __init__(self, to_match, f=id): Parser.__init__(self) self.re=re.compile(to_match) self.f=f def parse(self, s, pos): m=self.re.match(s, pos) if m: yield (self.f(m.group(0)), m.end())
This allows things like
RegEx("\\d+").apply(string.atoi), which is highly useful.
Choice, it's also simple: just yield what parser A returned with
self.fapplied, then do the same for parser B:
class Choice(Parser): def __init__(self, a, b, f=id): Parser.__init__(self) self.a=a self.b=b self.f=f def parse(self, s, pos): for (a_result, a_pos) in self.a.parse(s, pos): yield (self.f(a_result), a_pos) for (b_result, b_pos) in self.b.parse(s, pos): yield (self.f(b_result), b_pos)
So what about
Sequence? Since parser functions now return something user-defined as the result instead of a list of tokens, we cannot concatenate the results of parser A and parser B. The
Sequenceparser combinator is the one that will be used to build structures (e.g. number A followed by operator O, followed by number B becomes a structure saying that operator O should be applied to number A and number B). We're just adding a default here, since the user will override this for all interesting structures. So let's use the simplest possible structure as a default: the tuple.
def pair(a, b): return (a,b) class Sequence(Parser): whitespace=re.compile("[ \\t\\n]+") def __init__(self, a, b, keep_whitespace=False, f=pair): Parser.__init__(self) self.a=a self.b=b self.keep_whitespace=keep_whitespace self.f=f def parse(self, s, pos): for (a_result, a_pos) in self.a.parse(s, pos): if not self.keep_whitespace: m=Sequence.whitespace.match(s, a_pos) if m: a_pos=m.end() for (b_result, b_pos) in self.b.parse(s, a_pos): yield (self.f(a_result, b_result), b_pos)
I leave the function application changes to the other parsers as an exercise for the reader.
Now, let's use this in the example parser. As I'm sure you all noticed, there was a fault in the operator precedence in the previous post: addition bound harder than multiplication. This is fixed now. I've also added some classes for holding the AST, and they also have an
evalmethod to evaluate the subexpression they represent. The example parser thus becomes:
def left(a, b): return a def right(a, b): return b class Add(object): def __init__(self, a, b): self.a=a self.b=b def eval(self): return self.a.eval() + self.b.eval() def __repr__(self): return "Add<%s,%s>" % (self.a, self.b) class Mul(object): def __init__(self, a, b): self.a=a self.b=b def eval(self): return self.a.eval() * self.b.eval() def __repr__(self): return "Mul<%s,%s>" % (self.a, self.b) class Number(object): def __init__(self, s): self.n=int(s) def eval(self): return self.n def __repr__(self): return str(self.n) expr_rec=Recurse() number = RegEx("0|([1-9][0-9]*)").apply(Number) paren = Sequence(Sequence(Exact("("), expr_rec).apply(right), Exact(")")).apply(left) simple = Choice(number, paren) mul = Choice(Sequence(Sequence(simple, Exact("*")).apply(left), expr_rec).apply(Mul), simple) add = Choice(Sequence(Sequence(mul, Exact("+")).apply(left), expr_rec).apply(Add), mul) expr = add expr_rec.set(expr) s="1 +2\t*(4+3*2 )" for (result,pos) in expr.parse(s, 0): print "Using the first %d characters" % pos print s[:pos] print result print result.eval()
Which, as expected, produces the following results:
Using the first 15 characters 1 +2 *(4+3*2 ) Add<1,Mul<2,Add<4,Mul<3,2>>>> 21 Using the first 5 characters 1 +2 Add<1,2> 3 Using the first 1 characters 1 1 1
That's all for this time. In the next, and final, episode, we'll take a look at how to make the parser look pretty by abusing operator overloading.
Again, thanks a lot. This is very clear.ReplyDelete
Links to previous and next parts:
Part 1: http://blog.zoom.nu/2010/05/parser-combinators-in-python-part-1.html
Part 3: http://blog.zoom.nu/2010/05/parser-combinators-in-python-part-3.html
Oops, bad information is worse than no information; apologies for the wrong link to Part 1: it's atReplyDelete
(Please feel free to delete my comments.)
Thank you very much for reading it. I've updated part one and two with links to the other posts as you suggested. I hope the series was of help to you!ReplyDelete