git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Handle foreign character web input


On 6/29/19 3:19 AM, Thomas Jollans wrote:
> On 28/06/2019 22:25, Tobiah wrote:
>> A guy comes in and enters his last name as R?nngren.
> With a capital ? in the middle? That's unusual.
>>
>> So what did the browser really give me; is it encoded
>> in some way, like latin-1?? Does it depend on whether
>> the name was cut and pasted from a Word doc. etc?
>> Should I handle these internally as unicode?? Right
>> now my database tables are latin-1 and things seem
>> to usually work, but not always.
>
>
> If your database is using latin-1, German and French names will work,
> but Croatian and Polish names often won't. Not to mention people using
> other writing systems.
>
> So G?nther and Fran?ois are ok, but Boles?aw turns into Boles?aw and
> don't even think about anybody called ???????? or ????. 

I would say that currently, the only real reason to use an encoding
other than Unicode (normally UTF-8) would be historical inertia. Maybe a
field that will only ever have plain ASCII characters could use ASCII
(such a field would never have real natural language words, but only
computer generated codes). All the various 'codepages' were useful in
their day, when machines were less capable, and Unicode hadn't been
invented or wasn't supported well or was too expensive to use.

Now (as I understand it), all Python (3) 'Strings' are internally
Unicode, if you need something with a different encoding it needs to be
in Bytes.

-- 
Richard Damon