git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Pythonic custom multi-line parsers


Hi list,

I'm looking for ideas as to a pretty, Pythonic solution for a specific
problem that I am solving over and over but where I'm never happy about
the solution in the end. It always works, but never is pretty. So see
this as an open-ended brainstorming question.

Here's the task: There's a custom file format. Each line can be parsed
individually and, given the current context, the meaning of each
individual line is always clearly distinguishable. I'll give an easy
example to demonstrate:


moo = koo
bar = foo
foo :=
   abc
   def
baz = abc

Let's say the root context knows only two regexes and give them names:

keyvalue: \w+ = \w+
start-multiblock: \w+ :=

The keyvalue is contained in itself, when the line is successfully
parsed all the information is present. The start-multiblock however
gives us only part of the puzzle, namely the name of the following
block. In the multiblock context, there's different regexes that can
happen (actually only one):

multiblock-item: \s\w+

Now obviously whe the block is finished, there's no delimiter. It's
implicit by the multiblock-item regex not matching and therefore we
backtrack to the previous parser (root parser) and can successfully
parse the last line baz = abc.

Especially consider that even though this is a simple example, generally
you'll have multiple contexts, many more regexes and especially nesting
inside these contexts.

Without having to use a parser generator (for those the examples I deal
with are usually too much overhead) what I usually end up doing is
building a state machine by hand. I.e., I memorize the context, match
those and upon no match manually delegate the input data to backtracked
matchers.

This results in AWFULLY ugly code. I'm wondering what your ideas are to
solve this neatly in a Pythonic fashion without having to rely on
third-party dependencies.

Cheers,
Joe