[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Library for parsing binary structures

On 2019-03-29 16:34:35 +0000, Paul Moore wrote:
> On Fri, 29 Mar 2019 at 16:16, Peter J. Holzer <hjp-python at> wrote:
> > Obviously you need some way to describe the specific binary format you
> > want to parse - in other words, a grammar. The library could then use
> > the grammar to parse the input - either by interpreting it directly, or
> > by generating (Python) code from it. The latter has the advantage that
> > it has to be done only once, not every time you want to parse a file.
> >
> > If that sounds familiar, it's what yacc does. Except that it does it for
> > text files, not binary files. I am not aware of any generic binary
> > parser generator for Python. I have read research papers about such
> > generators for (I think) C and Java, but I don't remember the names and
> > I'm not sure if the generators got beyond the proof of concept stage.
> That's precisely what I'm looking at. The construct library
> ( basically does that, but using a
> DSL implemented in Python rather than generating Python code from a
> grammar.

Good to know. I'll add that to my list of Tools Which I'm Not Likely To
Use Soon But Which May Be Useful Some Day.

> However, the resulting parser works, but it gives horrible error
> messages. This is a normal problem with generated parsers, there are
> plenty of books and articles covering how to persuade tools like yacc
> to produce usable error reports on parse failures.

Yeah, that still seems to be an unsolved problem.

> I don't know which solution I'll ultimately use, but it's an
> interesting exercise doing it both ways. And parsing binary data,
> unlike parsing text, is actually easy enough that hand crafting a
> parser isn't that much of a bother - maybe that's why there's less
> existing work in this area.

I'm a bit sceptical about that. Writing a hand-crafted parser for most
text-based grammars isn't that hard either, but there are readily-
available tools (like yacc), so people use them (despite problems like
horrible error messages). For binary protocols, such tools are much less
well-known. It may be true that binary grammars seem simpler. But in
practice there are lots and lots of security holes because hand-crafted
parsers tend to use un-warranted shortcuts (see heart-bleed or the JPEG
parsing bug of the week), which an automatically generated parser would
not take.


   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp at         | management tools.
__/   | | -- Ross Anderson <>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <>