Since mucking around with the regex library to get it to search an index for documents that an expression might match seemed to be too big a task (and I'm afraid of big, scary real-world regex engine implementations) I wrote a little parsing library modeled around Parsing Combinators. Easy to implement, easy to use, and very powerful. You should definitely use them the next time you need an ad-hoc parser.
Anyway, back to the searching. The proper way of implementing sel is probably to compute all the sel results for all k-grams for a given k in one swell foop. This would reduce the number of whole-corpus searches from approximately 100000 times in my current toy example to around 20 for any corpus. That would even make it usable for real-world applications. Maybe it's time to play with Lucene again...
Post a Comment
Note: only a member of this blog may post a comment.