git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

python3, regular expression and bytes text


On 2019-10-12 20:48, Serhiy Storchaka wrote:
> 12.10.19 21:08, Eko palypse ????:
>> So how can I make it work with utf8 encoded text?
> 
> You cannot. First, \w in re.LOCALE works only when the text is encoded
> with the locale encoding (cp1252 in your case). Second, re.LOCALE
> supports only 8-bit charsets. So even if you set the utf-8 locale, it
> would not help.
> 
> Regular expressions with re.LOCALE are slow. It may be more efficient to
> decode text and use Unicode regular expression.
> 
+1

It's best to treat re.LOCALE as being for old legacy encodings that 
use/used 8 bits per character. Wherever possible, decode to Unicode and 
work with that instead.