git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

python3, regular expression and bytes text


What needs to be set in order to be able to use a re search within
utf8 encoded bytes?


My test, being on a windows PC with cp1252 setup, looks like this

import re
import locale


cp1252 = '?rger im Paradies'.encode('cp1252')
utf8 = '?rger im Paradies'.encode('utf-8')

print('cp1252:', cp1252)
print('utf8  :', utf8)
print('-'*80)
print("search for '?rger'.encode('cp1252') in cp1252 encoded text")
for m in re.finditer('?rger'.encode('cp1252'), cp1252):
    print(m)

print('-'*80)
print("search for '?rger'.encode('') in utf8 encoded text")
for m in re.finditer('?rger'.encode(), utf8):
    print(m)


print('-'*80)
print("search for '\\w+'.encode('cp1252') in cp1252 encoded text")
for m in re.finditer('\\w+'.encode('cp1252'), cp1252):
    print(m)

print('-'*80)
print("search for '\\w+'.encode('') in utf8 encoded text")
for m in re.finditer('\\w+'.encode(), utf8):
    print(m)

locale.setlocale(locale.LC_ALL, '')
print('-'*80)
print("search for '\\w+'.encode('cp1252') using re.LOCALE in cp1252 encoded text")
for m in re.finditer('\\w+'.encode('cp1252'), cp1252, re.LOCALE):
    print(m)

print('-'*80)
print("search for '\\w+'.encode('') using ??? in utf8 encoded text")
for m in re.finditer('\\w+'.encode(), utf8):
    print(m)



if you run this you will get something like



cp1252: b'\xc4rger im Paradies'
utf8  : b'\xc3\x84rger im Paradies'
--------------------------------------------------------------------------------
search for '?rger'.encode('cp1252') in cp1252 encoded text
<re.Match object; span=(0, 5), match=b'\xc4rger'>
--------------------------------------------------------------------------------
search for '?rger'.encode('') in utf8 encoded text
<re.Match object; span=(0, 6), match=b'\xc3\x84rger'>
--------------------------------------------------------------------------------


these two are ok BUT the result for \w+ shows a difference


search for '\w+'.encode('cp1252') in cp1252 encoded text
<re.Match object; span=(1, 5), match=b'rger'>
<re.Match object; span=(6, 8), match=b'im'>
<re.Match object; span=(9, 17), match=b'Paradies'>
--------------------------------------------------------------------------------
search for '\w+'.encode('') in utf8 encoded text
<re.Match object; span=(2, 6), match=b'rger'>
<re.Match object; span=(7, 9), match=b'im'>
<re.Match object; span=(10, 18), match=b'Paradies'>
--------------------------------------------------------------------------------


it doesn't find the ?, which from documentation point of view is expected
and a hint to use locale is given, so let's do it and the results are


search for '\w+'.encode('cp1252') using re.LOCALE in cp1252 encoded text
<re.Match object; span=(0, 5), match=b'\xc4rger'>
<re.Match object; span=(6, 8), match=b'im'>
<re.Match object; span=(9, 17), match=b'Paradies'>
--------------------------------------------------------------------------------


works for cp1252 BUT does not work for utf8


search for '\w+'.encode('') using ??? in utf8 encoded text
<re.Match object; span=(2, 6), match=b'rger'>
<re.Match object; span=(7, 9), match=b'im'>
<re.Match object; span=(10, 18), match=b'Paradies'>


So how can I make it work with utf8 encoded text?
Note, decoding it to a string isn't preferred as this would mean
allocating the bytes buffer a 2nd time and it might be that a 
buffer is several 100MBs, even GBs.

Thank you
Eren