zoom.nu: 2010-05

2010-05-29

Parser Combinators in Python, Part 3

This is the third and final post in the Parser Combinators in Python series.
Part 1
Part 2

The only remaining part is now to get the parsers to look a bit nicer. This will of necessity be a much less objective post, since this is about aesthetically pleasing syntax, which is a never-ending source for fist fights.

Currently, the parser for multiplication expressions looks like this:
mul = Choice(Sequence(Sequence(simple, Exact("*")).apply(left), expr_rec).apply(Mul), simple)

First, let's clean up the Choice and Sequence parts by using operator overloading on the Parser base class, like this:


class Parser(object):
        def __or__(self, other):
                return Choice(self, other)

        def __and__(self, other):
                return Sequence(self, other)

The multiplication parser can now be written as:
mul = ((simple & Exact("*")).apply(left) & expr_rec).apply(Mul) | simple

The same goes for the addition parser. The parenthesis parser can also be simplified, from Sequence(Sequence(Exact("("), expr_rec).apply(right), Exact(")")).apply(left) to ((Exact("(") & expr_rec).apply(right) & Exact(")")).apply(left). One thing that can be noted here is that there are a lot of Sequence combinators with left or right applied. That is, there are a lot of little things in the grammar (like the "(" and ")" in the parenthesis parser) that is necessary for the parsing but that we don't really care about in the AST. Let's add some syntax to get rid of them. I'll abuse the shift operators for sequences where on ly one result is used, and I'll make them be arrows pointing to the interesting part. That is, A >> B means "parse A, then parse B. Throw away the result from A, and return only the result from B". The additions to the Parser class are:


        def __lshift__(self, other):
                return Sequence(self, other, False, left)

        def __rshift__(self, other):
                return Sequence(self, other, False, right)

The parsers can now be written as:


expr_rec=Recurse()
number = RegEx("0|([1-9][0-9]*)").apply(Number)
paren = Exact("(") >> expr_rec << Exact(")")
simple = number | paren
mul = (simple << Exact("*") & expr_rec).apply(Mul) | simple
add = (mul << Exact("+") & expr_rec).apply(Add) | mul
expr = add
expr_rec.set(expr)

The two words that stick out here are apply and Exact. Let's get rid of Exact by special-casing strings in the operators, wrapping the other argument in an Exact parser as necessary:


        def __or__(self, other):
                if other.__class__==types.StringType:
                        return Choice(self, Exact(other))
                else:
                        return Choice(self, other)
        def __and__(self, other):
                if other.__class__==types.StringType:
                        return Sequence(self, Exact(other))
                else:
                        return Sequence(self, other)
        def __lshift__(self, other):
                if other.__class__==types.StringType:
                        return Sequence(self, Exact(other), False, left)
                else:
                        return Sequence(self, other, False, left)
        def __rshift__(self, other):
                if other.__class__==types.StringType:
                        return Sequence(self, Exact(other), False, right)
                else:
                        return Sequence(self, other, False, right)

Since the strings can sometimes be to the left of the operator, we need right-hand versions of the operators as well. I'll just show one, since they are all on the same form.


        def __rlshift__(self, other):
                if other.__class__==types.StringType:
                        return Sequence(Exact(other), self, False, left)
                else:
                        return Sequence(other, self, False, left)

The parsers are now down to


expr_rec=Recurse()
number = RegEx("0|([1-9][0-9]*)").apply(Number)
paren = "(" >> expr_rec << ")"
simple = number | paren
mul = (simple << "*" & expr_rec).apply(Mul) | simple
add = (mul << "+" & expr_rec).apply(Add) | mul
expr = add
expr_rec.set(expr)

Which is pretty sweet if you'd ask me. In languages that let you invent new operators, you could avoid overloading "and" etc. with new meaning, and also have something nice for apply. If we'd abuse the == operator for this, we'd get


expr_rec=Recurse()
number = RegEx("0|([1-9][0-9]*)")   == Number
paren = "(" >> expr_rec << ")"
simple = number | paren
mul_op = simple << "*" & expr_rec   == Mul
mul = mul_op | simple
add_op = mul << "+" & expr_rec      == Add
add = add_op | mul
expr = add
expr_rec.set(expr)

This might arguably look better (the AST node creation really stands out), but is just too wrong for my taste.

So, we've come to road's end here. There are limits to how close you get get Python to BNF. Still, you can come close enough to have parsers that look a fair bit like their grammars, and that also are reasonably efficient. We also get the side benefit of being able to handle ambiguous grammars. In conclusion: the next time you write a parser, take a look at your friendly neighborhood parser combinator library. For Java, check out JParsec. For Python, there's PyParsing. There's implementations for most other languages as well. Enjoy!

2010-05-21

SCons, half a year later

It's hard to believe, but it's already been half a year since I started using SCons for real. Seems like a good time to follow it up.

SCons is still very impressive. There are so many things you can do with a real language in the build script that there's no way I'll go back given a choice. Just today, I was writing a little test case for a Cygwin bug where large executables would crash. To test where the limit way, I created a little C file like this


char BLOAT_NAME[]={
#include "bloat.h"
};

with an entirely uninteresting bloat.h containing one million numbers with commas between them. Then, a SConstruct like this to build and link:


objects=[]
for i in range(100):
  objects.append(StaticObject(
    source="bloat.c",
    target="bloat_%d.o" % i,
    CFLAGS="-DBLOAT_NAME=g_bloat_%d" % i
  ))
Program(target="bloat.exe", source=objects + ["main.c"])

Et voila, a hundred object files were built from that one source file. With Make (even GNU Make), I don't know how to do it without depending on things like calling seq from a $(shell) and other icky platform-dependent stuff. Silly little example, but still.

A more serious use is in my big project (a half-secret reimplementation of the important parts of the system we're working on, which gives me 10 seconds edit-compile-debug cycles instead of 15 minutes), I wrote a little parser in Python for the build configuration files we have, and call that from my SConstruct. I loop over the list of source files and options that the parser produces, and add the corresponding nodes to the construction environment (i.e. I put StaticObjects in a list and then pass the list to Program like in the trivial example above). Works perfectly!

In a different side project, I also do multi-stage building, where I first build a tool that generates C code, and then build the generated code. The tool is wrapped in a Builder, and the source files that are generated get dependencies on the tool binary, so everything runs in the correct order.

That said, SCons does have some shortcomings. The biggest one is lack of detailed user documentation. There is a user guide that gets you started on simple projects, but when it comes to the details of the provided builders, or exactly how SCons interacts with user-defined builders, the manual is silent. The only other documentation there is is the API reference, but it's very shallow, and you're better of reading the source. But that's of course not a proper solution, since the code only tells you what happens to work at the moment, and not what you can rely on for the future.

Another problem is release engineering. Until recently, there were snapshots from the source control, and there were releases that happened out of the blue once every year or so, with barely any maintenance done. Now, they seem to have changed a little, as there is an alpha version of 2.0 out. Hopefully, they will sort things out.

So, in conclusion: SCons still zooms. It zooms enough to get this die-hard Make fan convert. Trust me, I'm picky about build systems. Really picky. SCons is the first system I've used that is actually better than Make. It is a proper build system that tracks dependencies and all that. Yet it's dead easy to start using. You should definitely use it.

2010-05-20

Parser Combinators in Python, Part 2

This is part two of a three-part series on Parser Combinators in Python

My earlier blog post about parser combinators in Python ended with a real cliffhanger. How can the results be made into an AST instead of a list of tokens? What about the promised BNF-like syntax when building parsers? The arithmetic expression parser used as an example was quite far from that goal. Fear not, I'm back to solve at least the first of those two mysteries in this post.
How can we get a more useful result? Currently, all the parsers return lists of tokens (plain strings, actually), and Sequence and Choice concatenate lists. Ideally, we'd like something like a tree of AST elements, some representing operators, and some representing numbers.
So some parsers will return tokens or other objects representing some kind of terminal symbol, and some will return some kind of structures. The user will likely want to decide what kind of objects are used for the different terminals and which structures are used to combine them, and what parts should be used and what should be ignored. Let's add support for having the parsers apply a user-defined function to the parse result to the Parser base class:


class Parser(object):
        def apply(self, f):
                self.f=f
                return self

...and then add calls to self.f to all parse methods. For Exact it becomes:


def id(x):
        return x

class Exact(Parser):
        def __init__(self, text, f=id):
                Parser.__init__(self)
                self.text=text
                self.f=f

        def parse(self, s, pos):
                if s.startswith(self.text, pos):
                        yield (self.f(self.text), pos+len(self.text))

The user can now write Exact("Hej").apply(SwedishToEnglish), which will return "Hi" given a suitable SwedishToEnglish implementation. RegEx is similarly simple:


class RegEx(Parser):
        def __init__(self, to_match, f=id):
                Parser.__init__(self)
                self.re=re.compile(to_match)
                self.f=f
        def parse(self, s, pos):
                m=self.re.match(s, pos)
                if m:
                        yield (self.f(m.group(0)), m.end())

This allows things like RegEx("\\d+").apply(string.atoi), which is highly useful.
For Choice, it's also simple: just yield what parser A returned with self.f applied, then do the same for parser B:


class Choice(Parser):
        def __init__(self, a, b, f=id):
                Parser.__init__(self)
                self.a=a
                self.b=b
                self.f=f

        def parse(self, s, pos):
                for (a_result, a_pos) in self.a.parse(s, pos):
                        yield (self.f(a_result), a_pos)
                for (b_result, b_pos) in self.b.parse(s, pos):
                        yield (self.f(b_result), b_pos)

So what about Sequence? Since parser functions now return something user-defined as the result instead of a list of tokens, we cannot concatenate the results of parser A and parser B. The Sequence parser combinator is the one that will be used to build structures (e.g. number A followed by operator O, followed by number B becomes a structure saying that operator O should be applied to number A and number B). We're just adding a default here, since the user will override this for all interesting structures. So let's use the simplest possible structure as a default: the tuple.


def pair(a, b):
        return (a,b)

class Sequence(Parser):
        whitespace=re.compile("[ \\t\\n]+")
        def __init__(self, a, b, keep_whitespace=False, f=pair):
                Parser.__init__(self)
                self.a=a
                self.b=b
                self.keep_whitespace=keep_whitespace
                self.f=f

        def parse(self, s, pos):
                for (a_result, a_pos) in self.a.parse(s, pos):
                        if not self.keep_whitespace:
                                m=Sequence.whitespace.match(s, a_pos)
                                if m:
                                        a_pos=m.end()
                        for (b_result, b_pos) in self.b.parse(s, a_pos):
                                yield (self.f(a_result, b_result), b_pos)

I leave the function application changes to the other parsers as an exercise for the reader.
Now, let's use this in the example parser. As I'm sure you all noticed, there was a fault in the operator precedence in the previous post: addition bound harder than multiplication. This is fixed now. I've also added some classes for holding the AST, and they also have an eval method to evaluate the subexpression they represent. The example parser thus becomes:


def left(a, b):
        return a

def right(a, b):
        return b

class Add(object):
        def __init__(self, a, b):
                self.a=a
                self.b=b
        def eval(self):
                return self.a.eval() + self.b.eval()
        def __repr__(self):
                return "Add<%s,%s>" % (self.a, self.b)

class Mul(object):
        def __init__(self, a, b):
                self.a=a
                self.b=b
        def eval(self):
                return self.a.eval() * self.b.eval()
        def __repr__(self):
                return "Mul<%s,%s>" % (self.a, self.b)

class Number(object):
        def __init__(self, s):
                self.n=int(s)
        def eval(self):
                return self.n
        def __repr__(self):
                return str(self.n)

expr_rec=Recurse()
number = RegEx("0|([1-9][0-9]*)").apply(Number)
paren = Sequence(Sequence(Exact("("), expr_rec).apply(right), Exact(")")).apply(left)
simple = Choice(number, paren)
mul = Choice(Sequence(Sequence(simple, Exact("*")).apply(left), expr_rec).apply(Mul), simple)
add = Choice(Sequence(Sequence(mul, Exact("+")).apply(left), expr_rec).apply(Add), mul)
expr = add
expr_rec.set(expr)

s="1  +2\t*(4+3*2 )"
for (result,pos) in expr.parse(s, 0):
        print "Using the first %d characters" % pos
        print s[:pos]
        print result
        print result.eval()

Which, as expected, produces the following results:


Using the first 15 characters
1  +2 *(4+3*2 )
Add<1,Mul<2,Add<4,Mul<3,2>>>>
21
Using the first 5 characters
1  +2
Add<1,2>
3
Using the first 1 characters
1
1
1

That's all for this time. In the next, and final, episode, we'll take a look at how to make the parser look pretty by abusing operator overloading.

2010-05-10

Random changes

Note to self: if random changes make a problem appear or disappear, maybe the operative word is "random", and not "changes".

2010-05-08

Parser Combinators in Python

This is part one of a three-part series on Parser Combinators in Python

In an earlier post, I mentioned Parser Combinators and that you should use them. In this post, I will explain a bit what they are and how they work. In actual code, I suggest you use a ready-made library like PyParsing or JParsec, but it's nevertheless useful to know how they work.
For some reason, almost every single parsing tool ever written takes as input a file in some strange mixture of a domain-specific language and code snippets in the language you actually want to use, and generates a new file that holds a parsing function. Since I usually do not write parsers, and the domain-specific languages are invariably cryptic, I tend not to write parsers using those tools. I'm sure they work very well if you take the time to learn them, but for those of us only writing a non-trivial parser every other year or so, it doesn't make much sense.
There are alternatives, though. Recursive descent parsers, where you have a bunch of functions representing each rule, and have the functions calling each other like in the below example, works but are a bit tedious to write:


class Term(Parser):
        term_re=re.compile("0|([1-9][0-9]*)")
        def __init__(self, input):
                input.save_mark()
                m=Term.term_re.match(input.buffer())
                if m:
                        input.discard_mark()
                        self.term=m.group(0)
                        input.consume(m.end())
                else:
                        input.restore_mark()
                        raise ParseError()
class Add(Parser):
        def __init__(self, input):
                input.save_mark()
                try:
                        self.term_a=Term(input)
                        self.op=AddOperator(input)
                        self.term_b=Term(input)
                        input.discard_mark()
                except ParseError as e:
                        input.restore_mark()
                        raise e

Back-tracking is here implemented using exceptions and a class that can handle a number of restore points in a string.
A much better way of writing these is to use parser combinators. Parser combinators wrap up the code that combine different simple parsers, so that the previous example can be written to say that an Add is a Term followed by an AddOperator followed by another Term.
There are three things that come together in most parser combinator implementations: letting the parsers return a list of possible parses instead of just a single parse, lazy evaluation, and the use of operator overloading to make the definition of parsers look like BNF. The first feature gives us the ability to handle ambiguous grammars, which can be nice at times, but the main benefit is that when combined with the second, it gives us backtracking more or less for free. The operator overloading is entirely optional, but can make the parsers that are written using the library much easier to read.
So let's get started. Parsers are supposed to return a list of parse results, so a parser for exact matched would look like this:


class Exact(Parser):
        def __init__(self, text):
                self.text=text

        def parse(self, s, pos):
                if s.startswith(self.text, pos):
                        yield ([self.text], pos+len(self.text))

If the input text (s, at position pos) is exactly what we're looking for (self.text), then we yield that result and the position of the next character in the input. We use yield instead of appending to a list and retuning it to get lazy evaluation (which does not give us anything yet, but is essential when we come to alternatives and sequences)
Stepping up in abstraction a small step, we can try optional parsing. Given a parser that parses something, Optional gives us a new parser that parses that something, or nothing at all if that fails. Or actually, since we're generating a list of all possible parses, we yield both the empty parse and the something.


class Optional(Parser):
        def __init__(self, p):
                self.p=p

        def parse(self, s, pos):
                yield ([], pos)
                for t in self.p.parse(s, pos):
                        yield t

Now we can do things like list(Optional(Exact("1")).parse("123")) and get [([], 0), (["1"], 1)]: the two ways of parsing an optional "1" from "123" is nothing with 0 as the next position, or "1" with 1 as the next position.
These higher-order parsers, that is, parser classes that take parser instances as arguments and use them for the actual parsing, are called parser combinators.
Let's try choice next. Either parsing a "1" or parsing a "2" from "123" should be every possible way of parsing "1" from it, followed by every possible way of parsing "2" from it.


class Choice(Parser):
        def __init__(self, a, b):
                self.a=a
                self.b=b

        def parse(self, s, pos):
                for (a_results, a_pos) in self.a.parse(s, pos):
                        yield (a_results, a_pos)
                for (b_results, b_pos) in self.b.parse(s, pos):
                        yield (b_results, b_pos)

It doesn't matter if self.a succeeds or fails its parsing, we yield every possible parse of self.b from the same position anyway. This would normally be very inefficient: if a succeeds, there's no reason to try b. This is where lazy evaluation comes in. Unless someone actually asks for more results after we yielded what self.a yielded, self.b will never be used. Choice(Extact("1"), Exact("2)).parse("123").next() will never call the parse method of Exact("2"). If we use list() to get every single result from the parse, it will be called though. If you ask for only the first parse, the parsing will run until it gets one result. If you ask for every single possible way to parse the input, then the parser will try every single possible way of parsing it.
Next, let's do sequence. For every possible way of using parser A, use parser B where A left off. The result is whatever A produced, followed by whatever B produced, and the ending position is where B left off:


class Sequence(Parser):
        def __init__(self, a, b):
                self.a=a
                self.b=b

        def parse(self, s, pos):
                for (a_results, a_pos) in self.a.parse(s, pos):
                        for (b_results, ab_pos) in self.b.parse(s, a_pos):
                                yield (a_results + b_results, ab_pos)

In many cases, we want to ignore whitespace between any parts of the grammar. This can be done in the Sequence combinator by skipping whitespace after parsing with A, but before parsing with B, like this:


class Sequence(Parser):
        whitespace=re.compile("[ \\t\\n]+")
        def __init__(self, a, b, keep_whitespace=False):
                self.a=a
                self.b=b
                self.keep_whitespace=keep_whitespace

        def parse(self, s, pos):
                for (a_results, a_pos) in self.a.parse(s, pos):
                        if not self.keep_whitespace:
                                m=Sequence.whitespace.match(s, a_pos)
                                if m:
                                        a_pos=m.end()
                        for (b_results, ab_pos) in self.b.parse(s, a_pos):
                                yield (a_results + b_results, ab_pos)

The final part to having support for the BNF operators is something that allows us to match a parser multiple times. This is almost exactly like the Sequence parser, except that parser B is replaced with the combinator itself:


class Many(Parser):
        def __init__(self, p):
                self.p=p

        def parse(self, s, pos):
                yield ([], pos)
                for (p_results, p_pos) in self.p.parse(s, pos):
                        for (self_results, self_pos) in self.parse(s, p_pos):
                                yield (p_results + self_results, self_pos)

For most non-trivial grammars, there will be some rules that refer themselves, either directly or via some other rule. Since Python does not have forward declarators, we can use a little proxy class to solve it:


class Recurse(Parser):
        def set(self, p):
                self.p=p
        def parse(self, s, pos):
                for result in self.p.parse(s, pos):
                        yield result

This way, we can get the parser object in advance, refer to it in a bunch of rules, and then define the real recursively used parser and call the set method on the proxy, and everything will work.
In order to skip writing the boring parts of matching numbers or identifiers using sequences of choices of exact parsers, I usually use a regex parser, like so:


class RegEx(Parser):
        def __init__(self, to_match):
                self.re=re.compile(to_match)
        def parse(self, s, pos):
                m=self.re.match(s, pos)
                if m:
                        yield ([m.group(0)], m.end())

Putting all this together, we can write a parser for simple arithmetic expressions:


expr_rec = Recurse()
number = RegEx("0|([1-9][0-9]*)")
paren = Sequence(Exact("("), Sequence(expr_rec, Exact(")")))
simple = Choice(number, paren)
add = Choice(Sequence(simple, Sequence(Exact("+"), expr_rec)), simple)
mul = Choice(Sequence(add, Sequence(Exact("*"), expr_rec)), add)
expr = mul
expr_rec.set(expr)

There are two problems with this code: the first is that it's pretty hard to read with all the Sequence and Choice calls, and the second is that no AST is built: it's all just returned as a flat list of tokens. Since this post is long enough already, these problems will be solved in the next post.