This is part one of a three-part series on Parser Combinators in Python
In an earlier post, I mentioned Parser Combinators and that you should use them. In this post, I will explain a bit what they are and how they work. In actual code, I suggest you use a ready-made library like PyParsing or JParsec, but it's nevertheless useful to know how they work.
For some reason, almost every single parsing tool ever written takes as input a file in some strange mixture of a domain-specific language and code snippets in the language you actually want to use, and generates a new file that holds a parsing function. Since I usually do not write parsers, and the domain-specific languages are invariably cryptic, I tend not to write parsers using those tools. I'm sure they work very well if you take the time to learn them, but for those of us only writing a non-trivial parser every other year or so, it doesn't make much sense.
There are alternatives, though. Recursive descent parsers, where you have a bunch of functions representing each rule, and have the functions calling each other like in the below example, works but are a bit tedious to write:
class Term(Parser): term_re=re.compile("0|([1-9][0-9]*)") def __init__(self, input): input.save_mark() m=Term.term_re.match(input.buffer()) if m: input.discard_mark() self.term=m.group(0) input.consume(m.end()) else: input.restore_mark() raise ParseError() class Add(Parser): def __init__(self, input): input.save_mark() try: self.term_a=Term(input) self.op=AddOperator(input) self.term_b=Term(input) input.discard_mark() except ParseError as e: input.restore_mark() raise e
Back-tracking is here implemented using exceptions and a class that can handle a number of restore points in a string.
A much better way of writing these is to use parser combinators. Parser combinators wrap up the code that combine different simple parsers, so that the previous example can be written to say that an Add is a Term followed by an AddOperator followed by another Term.
There are three things that come together in most parser combinator implementations: letting the parsers return a list of possible parses instead of just a single parse, lazy evaluation, and the use of operator overloading to make the definition of parsers look like BNF. The first feature gives us the ability to handle ambiguous grammars, which can be nice at times, but the main benefit is that when combined with the second, it gives us backtracking more or less for free. The operator overloading is entirely optional, but can make the parsers that are written using the library much easier to read.
So let's get started. Parsers are supposed to return a list of parse results, so a parser for exact matched would look like this:
class Exact(Parser): def __init__(self, text): self.text=text def parse(self, s, pos): if s.startswith(self.text, pos): yield ([self.text], pos+len(self.text))
If the input text (
s, at position
pos) is exactly what we're looking for (
self.text), then we yield that result and the position of the next character in the input. We use
yieldinstead of appending to a list and retuning it to get lazy evaluation (which does not give us anything yet, but is essential when we come to alternatives and sequences)
Stepping up in abstraction a small step, we can try optional parsing. Given a parser that parses something,
Optionalgives us a new parser that parses that something, or nothing at all if that fails. Or actually, since we're generating a list of all possible parses, we yield both the empty parse and the something.
class Optional(Parser): def __init__(self, p): self.p=p def parse(self, s, pos): yield (, pos) for t in self.p.parse(s, pos): yield t
Now we can do things like
[(, 0), (["1"], 1)]: the two ways of parsing an optional "1" from "123" is nothing with 0 as the next position, or "1" with 1 as the next position.
These higher-order parsers, that is, parser classes that take parser instances as arguments and use them for the actual parsing, are called parser combinators.
Let's try choice next. Either parsing a "1" or parsing a "2" from "123" should be every possible way of parsing "1" from it, followed by every possible way of parsing "2" from it.
class Choice(Parser): def __init__(self, a, b): self.a=a self.b=b def parse(self, s, pos): for (a_results, a_pos) in self.a.parse(s, pos): yield (a_results, a_pos) for (b_results, b_pos) in self.b.parse(s, pos): yield (b_results, b_pos)
It doesn't matter if
self.asucceeds or fails its parsing, we yield every possible parse of
self.bfrom the same position anyway. This would normally be very inefficient: if a succeeds, there's no reason to try b. This is where lazy evaluation comes in. Unless someone actually asks for more results after we yielded what
self.bwill never be used.
Choice(Extact("1"), Exact("2)).parse("123").next()will never call the
Exact("2"). If we use
list()to get every single result from the parse, it will be called though. If you ask for only the first parse, the parsing will run until it gets one result. If you ask for every single possible way to parse the input, then the parser will try every single possible way of parsing it.
Next, let's do sequence. For every possible way of using parser A, use parser B where A left off. The result is whatever A produced, followed by whatever B produced, and the ending position is where B left off:
class Sequence(Parser): def __init__(self, a, b): self.a=a self.b=b def parse(self, s, pos): for (a_results, a_pos) in self.a.parse(s, pos): for (b_results, ab_pos) in self.b.parse(s, a_pos): yield (a_results + b_results, ab_pos)
In many cases, we want to ignore whitespace between any parts of the grammar. This can be done in the Sequence combinator by skipping whitespace after parsing with A, but before parsing with B, like this:
class Sequence(Parser): whitespace=re.compile("[ \\t\\n]+") def __init__(self, a, b, keep_whitespace=False): self.a=a self.b=b self.keep_whitespace=keep_whitespace def parse(self, s, pos): for (a_results, a_pos) in self.a.parse(s, pos): if not self.keep_whitespace: m=Sequence.whitespace.match(s, a_pos) if m: a_pos=m.end() for (b_results, ab_pos) in self.b.parse(s, a_pos): yield (a_results + b_results, ab_pos)
The final part to having support for the BNF operators is something that allows us to match a parser multiple times. This is almost exactly like the
Sequenceparser, except that parser B is replaced with the combinator itself:
class Many(Parser): def __init__(self, p): self.p=p def parse(self, s, pos): yield (, pos) for (p_results, p_pos) in self.p.parse(s, pos): for (self_results, self_pos) in self.parse(s, p_pos): yield (p_results + self_results, self_pos)
For most non-trivial grammars, there will be some rules that refer themselves, either directly or via some other rule. Since Python does not have forward declarators, we can use a little proxy class to solve it:
class Recurse(Parser): def set(self, p): self.p=p def parse(self, s, pos): for result in self.p.parse(s, pos): yield result
This way, we can get the parser object in advance, refer to it in a bunch of rules, and then define the real recursively used parser and call the
setmethod on the proxy, and everything will work.
In order to skip writing the boring parts of matching numbers or identifiers using sequences of choices of exact parsers, I usually use a regex parser, like so:
class RegEx(Parser): def __init__(self, to_match): self.re=re.compile(to_match) def parse(self, s, pos): m=self.re.match(s, pos) if m: yield ([m.group(0)], m.end())
Putting all this together, we can write a parser for simple arithmetic expressions:
expr_rec = Recurse() number = RegEx("0|([1-9][0-9]*)") paren = Sequence(Exact("("), Sequence(expr_rec, Exact(")"))) simple = Choice(number, paren) add = Choice(Sequence(simple, Sequence(Exact("+"), expr_rec)), simple) mul = Choice(Sequence(add, Sequence(Exact("*"), expr_rec)), add) expr = mul expr_rec.set(expr)
There are two problems with this code: the first is that it's pretty hard to read with all the
Choicecalls, and the second is that no AST is built: it's all just returned as a flat list of tokens. Since this post is long enough already, these problems will be solved in the next post.