[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

python3, regular expression and bytes text

On 2019-10-12 20:48, Serhiy Storchaka wrote:
> 12.10.19 21:08, Eko palypse ????:
>> So how can I make it work with utf8 encoded text?
> You cannot. First, \w in re.LOCALE works only when the text is encoded
> with the locale encoding (cp1252 in your case). Second, re.LOCALE
> supports only 8-bit charsets. So even if you set the utf-8 locale, it
> would not help.
> Regular expressions with re.LOCALE are slow. It may be more efficient to
> decode text and use Unicode regular expression.

It's best to treat re.LOCALE as being for old legacy encodings that 
use/used 8 bits per character. Wherever possible, decode to Unicode and 
work with that instead.