git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Handle foreign character web input


On 6/28/19 4:25 PM, Tobiah wrote:
> A guy comes in and enters his last name as R?nngren.
> 
> So what did the browser really give me; is it encoded
> in some way, like latin-1?? Does it depend on whether
> the name was cut and pasted from a Word doc. etc?
> Should I handle these internally as unicode?? Right
> now my database tables are latin-1 and things seem
> to usually work, but not always.
> 
> Also, what do people do when searching for a record.
> Is there some way to get 'Ronngren' to match the other
> possible foreign spellings?

The first thing I'd want to do is to produce a front-end to discover the 
character set (latin-1, whatever) and convert it to a standard UTF-8.  e.g.:

    data.decode('latin1').encode('utf8')

That gets rid of character set variations in the data, simplifying 
things before any of the hard work has to be done.

Then you have a choice - store and index everything as utf-8, or 
transliterate some or all strings to 7 bit US ASCII.  You may have to 
perform the same processing on input search strings.

I have not used it myself but there is a Python port of a Perl module 
by Sean M. Burke called Unidecode.  It will transliterate non-US ASCII 
strings into ASCII using reasonable substitutions of non-ASCII 
sequences.  I believe that there are other packages that can also do this.

The easy way to use packages like this is to transliterate entire 
records before putting them into your database, but then you may perplex 
or even offend some users who will look at a record and say "What's 
this?  That's not French!"  You'll also have to transliterate all input 
search strings.

A more sophisticated way is to leave the records in Unicode, but add 
transliterated index strings for those index strings that wind up 
containing utf-8 non-ASCII chars.

There are various ways to do this that tradeoff time, space, and 
programming effort.  You can store two versions of each record, search 
one and display the other.  You can just process index strings and add 
the transliterations to the record.  What to choose depends on your 
needs and resources.

And of course all bets are off if some of your data is Chinese, 
Japanese, Hebrew, or maybe even Russian or Greek.

Sometimes I think, Why don't we all just learn Esperanto?  But we all 
know that that isn't going to happen.

     Alan