git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Library for parsing binary structures


On Fri, 29 Mar 2019 at 16:16, Peter J. Holzer <hjp-python at hjp.at> wrote:

> Obviously you need some way to describe the specific binary format you
> want to parse - in other words, a grammar. The library could then use
> the grammar to parse the input - either by interpreting it directly, or
> by generating (Python) code from it. The latter has the advantage that
> it has to be done only once, not every time you want to parse a file.
>
> If that sounds familiar, it's what yacc does. Except that it does it for
> text files, not binary files. I am not aware of any generic binary
> parser generator for Python. I have read research papers about such
> generators for (I think) C and Java, but I don't remember the names and
> I'm not sure if the generators got beyond the proof of concept stage.

That's precisely what I'm looking at. The construct library
(https://pypi.org/project/construct/) basically does that, but using a
DSL implemented in Python rather than generating Python code from a
grammar. In fact, the problem I had with my recursive data structure
turned out to be solvable in construct - as the DSL effectively builds
a data structure describing the grammar, I was able to convert the
problem of writing a recursive grammar into one of writing a recursive
data structure:

type_layouts = {}
layout1 = <whatever>
layout2 = <something recursive referring to type_layouts>
type_layouts[1] = layout1
type_layouts[2] = layout2
data_layout = <get a typecode, parse the rest based on type_layouts[typecode]>

However, the resulting parser works, but it gives horrible error
messages. This is a normal problem with generated parsers, there are
plenty of books and articles covering how to persuade tools like yacc
to produce usable error reports on parse failures. There don't seem to
be any particularly good error reporting features in construct
(although I haven't looked closely), so I'm actually now looking at
writing a hand-crafted parser, just to control the error reporting[1].

I don't know which solution I'll ultimately use, but it's an
interesting exercise doing it both ways. And parsing binary data,
unlike parsing text, is actually easy enough that hand crafting a
parser isn't that much of a bother - maybe that's why there's less
existing work in this area.

Paul

[1] The errors I'm reporting on are likely to be errors in my parsing
code at this point, rather than errors in the data, but the problem is
pretty much the same either way ;-)