2009-06-02

Indexed regex searches

Just did a little Python implementation of the ideas in A Fast Regular Expression Indexing Engine. I only did the naïve thing and search the whole corpus each time in sel, but it works and it's cool.


Since mucking around with the regex library to get it to search an index for documents that an expression might match seemed to be too big a task (and I'm afraid of big, scary real-world regex engine implementations) I wrote a little parsing library modeled around Parsing Combinators. Easy to implement, easy to use, and very powerful. You should definitely use them the next time you need an ad-hoc parser.


Anyway, back to the searching. The proper way of implementing sel is probably to compute all the sel results for all k-grams for a given k in one swell foop. This would reduce the number of whole-corpus searches from approximately 100000 times in my current toy example to around 20 for any corpus. That would even make it usable for real-world applications. Maybe it's time to play with Lucene again...