python3, regular expression and bytes text
On 2019-10-12 20:57, Eko palypse wrote:
>> You cannot. First, \w in re.LOCALE works only when the text is encoded
>> with the locale encoding (cp1252 in your case). Second, re.LOCALE
>> supports only 8-bit charsets. So even if you set the utf-8 locale, it
>> would not help.
>> Regular expressions with re.LOCALE are slow. It may be more efficient to
>> decode text and use Unicode regular expression.
> Thank you, I guess I'm convinced to always decode everything (re pattern and text) to utf8 internally and then do the re search but then I would need to figure out the correct position, hmm - some ongoing investigation needed, I guess.
You don't _decode_ to UTF-8, you _decode_ to Unicode and _encode_ to UTF-8:
Decode: UTF-8 => Unicode
Encode: Unicode => UTF-8
How the Unicode is stored internally is a detail of the implementation.