[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

python3, regular expression and bytes text

On 10/12/19 3:46 PM, Eko palypse wrote:
> Thank you very much for your answer.
>> You have to be able to match bytes, not strings.
> May I ask you to elaborate on this, sorry non-native English speaker.
> The buffer I receive is a byte-like buffer.
>> I don't think you'll be able to 100% reliably match bytes in this way.
>> You're asking it to make analysis of multiple bytes and to interpret
>> them according to which character they would represent if decoded from
>> UTF-8.
>> My recommendation: Even if your buffer is multiple gigabytes, just
>> decode it anyway. Maybe you can decode your buffer in chunks, but
>> otherwise, just bite the bullet and do the decode. You may be
>> pleasantly surprised at how little you suffer as a result; Python is
>> quite decent at memory management, and even if you DO get pushed into
>> the swapper by this, it's still likely to be faster than trying to
>> code around all the possible problems that come from mismatching your
>> text search.
>> ChrisA
> That's what I was afraid of. 
> It would be nice if the "world" could commit itself to one standard, 
> but I'm afraid that won't happen in my life anymore, I guess. :-(
> Thx
> Eren

Current 'best practices' are in my opinion to convert data (if needed)
to some version of Unicode (UTF-8, UTF-16, or UCS-4) at input (if
needed) and process in that domain. You do need to be prepared to run
into files which are encoded in some locally defined 8-bit code page. In
Python3,? strings are unicode encoded, and you don't need to worry about
the details of which encoding is used internally, Python will deal with
that itself.

Richard Damon