python3, regular expression and bytes text
First of all many thanks to everyone for the active participation.
I think I understand what you illustrated with the byte example,
makes sense. As it was developed for 8-bit encoding only,
it cannot be used for mulitbyte encoding.
@Richard Damon and @MRAB
thank you very much for the information too, very much appreciated.
I think I understand what you all mean but I am not sure
how to put this all together.
Maybe a little bit more information about what I wanted to do.
Using notepad++ and scintilla.
Scintilla passes a readonly pointer with SCI_GETCHARACTERPOINTER
of the current buffer to me.
The problem is that the buffer can have all possible encodings.
cp1251, cp1252, utf8, ucs-2 ... but scintilla informs me about
which encoding is currently used.
I wanted to realize a regular expression tester with Python3,
and mark the text that has been matched by regular expressions.
After testing to treat everything as python3 str I found out that
the positions of the matched text are not correctly reported.
E.g say, if I want to find the word "?rger", assumed encoded in utf8, with
the regex \w+.
If I decode it, it would return the length of 5, whereas it is of length 6 within the document, so marking the match would be wrong, wouldn't it?
I understand the reason of the difference.
If I use the builtin find dialog of notepad++, which uses internally the boost::regex engine, I can use \w+ to find the word.
So that's where I'm stuck at the moment.
How can I find and mark those matches correctly.
Wrapping boost:regex with ICU support?