[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

python3, regular expression and bytes text

On Sun, Oct 13, 2019 at 6:54 AM Eko palypse <ekopalypse at> wrote:
> Thank you very much for your answer.
> > You have to be able to match bytes, not strings.
> May I ask you to elaborate on this, sorry non-native English speaker.
> The buffer I receive is a byte-like buffer.

When you're matching text (the normal way you use a regular
expression), every element in the RE matches a character (or
emptiness). For instance, the regular expression "^[bc]at$" has these

"^" matches emptiness at the start
"[bc]" matches either "b" or "c", nothing else
"a" matches "a", exactly
"t" matches "t"
"$" matches emptiness at the end.

When you're working with bytes, the same has to be true of the bytes.
But that means you can't match against b"\xc3\x84" - you have to match
b"\xc3" and b"\x84" separately. That's fine if all you need to do is
match a single specific character (in the same way that the "at" in my
text example will match "a" followed by "t"), but you can't ask
questions like "does the next few bytes represent a word character".
All you can do is ask "does the next ONE BYTE represent a word

> > I don't think you'll be able to 100% reliably match bytes in this way.
> > You're asking it to make analysis of multiple bytes and to interpret
> > them according to which character they would represent if decoded from
> > UTF-8.
> >
> > My recommendation: Even if your buffer is multiple gigabytes, just
> > decode it anyway. Maybe you can decode your buffer in chunks, but
> > otherwise, just bite the bullet and do the decode. You may be
> > pleasantly surprised at how little you suffer as a result; Python is
> > quite decent at memory management, and even if you DO get pushed into
> > the swapper by this, it's still likely to be faster than trying to
> > code around all the possible problems that come from mismatching your
> > text search.
> >
> > ChrisA
> That's what I was afraid of.
> It would be nice if the "world" could commit itself to one standard,
> but I'm afraid that won't happen in my life anymore, I guess. :-(

The world HAS committed to one standard - or one that matters. That
standard is Unicode text. Not UTF-8, but Unicode text. Decode your
bytes to text, *then* do your searches.