[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

python3, regular expression and bytes text

First of all many thanks to everyone for the active participation.

@Chris Angelico
I think I understand what you illustrated with the byte example,
makes sense. As it was developed for 8-bit encoding only, 
it cannot be used for mulitbyte encoding.

@Richard Damon and @MRAB
thank you very much for the information too, very much appreciated.

I think I understand what you all mean but I am not sure
how to put this all together.

Maybe a little bit more information about what I wanted to do.

Using notepad++ and scintilla. 
Scintilla passes a readonly pointer with SCI_GETCHARACTERPOINTER
of the current buffer to me.
The problem is that the buffer can have all possible encodings.
cp1251, cp1252, utf8, ucs-2 ... but scintilla informs me about
which encoding is currently used.

I wanted to realize a regular expression tester with Python3,
and mark the text that has been matched by regular expressions.

After testing to treat everything as python3 str I found out that
the positions of the matched text are not correctly reported.
E.g say, if I want to find the word "?rger", assumed encoded in utf8, with
the regex \w+. 
If I decode it, it would return the length of 5, whereas it is of length 6 within the document, so marking the match would be wrong, wouldn't it?

I understand the reason of the difference.

If I use the builtin find dialog of notepad++, which uses internally the boost::regex engine, I can use \w+ to find the word.

So that's where I'm stuck at the moment. 
How can I find and mark those matches correctly.
Wrapping boost:regex with ICU support?