Subject: Unicode normalization on MacOS/OSX
(internationalization and interoperability issue)



To Adam Fisk (LimeWire),
To other LimeWire GUI/Core developers,
For information, to the GDF subscribers too,

This email is quite long, sorry, but this is a complex issue, which may
require immediate changes in LimeWire (or other servents as well), first
because of a bug in Java for MacOS (and in MacOS itself for servents written
in C/C++), but also because it raises a bug related to immediate
internationalization issues and already existing interoperability problems.
It will also discuss some future evolutions in some critical part of the
search algorithm used on Gnutella.

It you have followed the discussions related to the problem found in
MacOS/OSX when HFS+ filenames are exposed to the application as decomposed
Unicode strings (in fact in FCD form, after I looked into the Apple VFS
sources, and not as I thought initially in NFD form),

We are concerned by the fact that MANY filename strings exposed by MacOS are
NOT even using simple ISO-8859-1 codes for ISO-8859-1 composite characters,
and to the fact that some of these decomposed characters are displayed
correctly on the Mac, and will neer be displayed correctly on other systems.

This raises many interoperability systems in LimeWire, even among MacOS
users themselves, because this behavior depends on the filesystem used on
each Mac (may be HFS+, encoded and exposed with Unicode FCD, or HFS encoded
with a legacy reduced 8-bit Mac character set exposed as Unicode NFC, or
UFS/NTFS/FAT32 encoded with Unicode NFC).

So when reading filenames from filesystems, we really are affected by these
differences, and by the propagation, on the network, of different binary
encodings of Unicode strings, despite they are all valid and equivalent
according to the NFC or NFD encoding form).

MacOS/OSX filesystems all accept to store files with names in NFC, NFD,
NFKD, NFKC form, but all of them will be forced to the denormalized form
FCD. Sharing FCD strings on the network is supported nowhere else than on
MacOS/OSX and only for file storage purpose on HFS+ volumes only.

We are also concerned by the fact that Asian input methods or keyboard
drivers expose to Java strings in other denormalized forms, that may be
interoperable only within these systems, mainly on Unix and Windows.

Given that the vast majority of servents on the network expect the NFC form
which is canonically equivalent to the normalized NFD form and FCD form, but
offers a much broader support than any other encofing forms, it seems that
we cannot avoid, even in the case of Western European languages, the need
for a canonicalizing NFC converter for filenames read from the filesystem.

On Mac OS/OSX, using such a converter will be always safe, given the fact
that HFS+ will always force its own "canonicalization" (which does not
conform to any Unicode standard, except for a related technical note related
to "fast decomposition" technics used to perform some internal processing of
strings, in which FCC and FCD forms are discussed). It will also enhance the
user experience on Mac, because their Mac system cannot even display
correctly filenames using the Apple's FCD encoding form, dur to limitations
in font renderers that can't compose characters to find their appropriate
glyph in fonts. This is true even for French or German MacOS users!

Then we will probably need to perform things in two steps, to integrate a
Unicode NFC normalizer in LimeWire, for example the one in the ICU4J package
(open-sourced by IBM with a X-Licence and already licenced by Sun in Java),
or a reduced version of ICU4J (where we would just import the Unicode
normalizer functions, and their NFC/NFD conversion tables).

Note that character composition in the current Unicode 3.2 version only
concerns less than 2000 pairs of base+combining characters (if performed
recursively to handle the case of multiple compositions)and the 11172
compositions of Korean Hangul syllables (performed algorithmically without a
table).

There are some tricky cases when performing any String normalization in
Unicode: the case of multiple diacritics that must be reordered after first
converting the input string to NFD, the case of multiple diacritics with the
same combining class that must not be reordered and whose composition must
be tried in their existing order, and the more complex case of intermediate
diacritics with different combining classes that may not be combined with
the base character, but do not block combining with a further diacritic, and
the associated case where two remaining diacritics may combine each other if
not blocked by another intermediate diacritic.

In a first attempt I looked in an Apple Technote related to HFS+ internal
encoding. Then I found that this technote was really old and only considered
the case of Unicode 2.0 rules, forgetting also many combinations that were
not documented (notably the Japanese voice marks applies to
Hiragana/Katakana syllables to modify their leading consonnant sound, as one
of our users reported as a bug in LimeWire).

So I looked in another associated document, related to the integration of
HFS+ into Apple VFS (Virtual File System) which allows MacOS/OSX to use many
Apple and non Apple filesystems using a common Mac API in applications.
However, the VFS design forgot that case of interoperability, and even in
MacOSX 10.2, such support is still missing, and its conversion routines are
missing important Unicode additions.

The current Apple FCD algorithm is flawed, because it is not stable and does
not work well with all other GUI elements including its Finder, and Apple
expects to change its policy regarding the support of Unicode strings in
HFS+, to use preferably a NFC or NFD composition form, whose stability is
guaranteed across versions of Unicode, with much less interoperability
issues. The initial "performance" gains when handling Unicode strings in
filenames for operations like ordering and B-tree storage are now
irrelevant, as B-tree just needs a coherent and stable form for equivalent
strings, and logical ordering (in the GUI interface) is locale dependant and
thus does not have to match the binary ordering of strings in HFS+ volumes.

This issue is critical for many reasons: the QRP algorithm does not match
canonically equivalent filenames across different platforms (for example a
French, German or Spanish Windows user cannot exchange files that contain
French, German or Spanish accents with a French, German or Spanish Macintosh
user); the GUI font renderer appropriate for the locales of users cannot
(most often) display correctly strings that are not normalized to their
normalized and canonical NFC form.

So the idea would be to force the NFC composition of strings both in the
GUI, and in all strings used in the protocol. This issue is *independant* of
the fact that we use UTF-8, ISO-8859-1 or other legacy encodings in Gnutella
messages. It is also *indendant* of the fact that other servents can or
cannot handle messages using Unicode: for example if you look at the FCD
encoding form {e; acute} used on MacOS to store a simple "é" (U+00C9), if we
send it with ISO-8859-1 we will send the {e} (U+0065), but not the {acute}
(U+0301) which is not convertible in ISO-8859-1, or we will send the wrong
character such as a {SOH control character} (U+0001). This will potentially
break some messages, and will definitely not match in QRP or with other
servents that expect to see the single "é" character (U+00C9).

So I propose this migration scheme:

1) First integrate and test a NFC/NFD converter in LimeWire. Its compliance
with Unicode can be tested using the test file provided on the Unicode.org
web site. I have started to perform this job, looking at possible issues,
because the ICU4J module will not be easy to adapt to Java 1.1.8 (on MacOS
with MRJ 2.5), and because a full integration of ICU4J will import many
classes that LimeWire currently don't need (notably those related to
NFKD/NFKC composition, charset converters, and String parsing utilities) or
that already have a partial implementation in Java 1.1 (extended in 1.4, and
that may be extended in the future to include some features found in ICU4J,
already licenced by Sun and Apple in Java).

2) Then, on MacOS/MacOSX *only*, force all strings returned by
File.getName() to their NFC form. For example:
String name = file.getName();
would be followed by:
if (CommonUtils.isAnyMac()) name = UnicodeString.NFC(name):
This change would be performed throughout the code. An alternative approach
would be to add a method in CommonUtils, and replace instead the above first
line by:
String name = CommonUtils.getFileName(file);
where the new method would call File.getName() and use the UnicodeString.NFC
converter. This will solve most of the issues related to the MacOS behavior
(that neither MRJ 2.5 for Mac OS 8/9, nor Java2 for Mac OSX correct for
now). But there will still be interoperability issues with other systems.

3) Change the way we store and display the filenames in the library, so that
the physical filename reported by the filesystem can be distinct from the
NFC form we use to display and share them. Or ensure that possible conflicts
(mainly on Unix where the string normalization is not performed accurately)
will be handled gracefully to avoid the case of distinct files with the same
canonically equivalent filenames. In that case, all strings received from
the network would first need to be canonicalized to their NFC form, as well
as all strings entered in the Search form or when renaming a file by user
input in the Library.

4) Discussing with the GDF of the way to specify that QRP tables will be
created by hashing Unicode strings in a normalized form. Here we have 4
choices, with distinct semantics face to search operations:

- a.1) The NFC form is simpler to implement, but it does not match the often
desirable feature of users that would like to find "cafe" "café" or "CAFE"
or "CAFÉ" when searching for either keywords. Note that supporting the NFC
form already implies supporting the NFD form.

- a.2) The NFD form (with decomposed accents) can be first used to detect
and remove diacritics (i.e. all Unicode characters that have a non-zero
combining class in Unicode, i.e. non-starter characters), before converting
keywords to lowercase. Supporting it is good only in the case where all
combining chars are removed from the decomposed string (in that case, the
filtered NFD string becomes also a simpler NFC string).

- b.1) The NFKC form has some merits (as it creates compatibility
equivalences for minor distinctions such as A-ring and Angström, or
full-width/half-width variants of Japanese Hiragana/Katakana or Korean Jamos
characters). Note that supporting the NFKC form already implies supporting
the NFKD form.

- b.2) The NFKD form is probably the best form as it combines the advantages
of NFD and NFKC for search operation. Supporting it is good only in the case
where all combining chars are removed from the decomposed string (in that
case, the filtered NFKD string becomes also a simpler NFD string and a
simpler NFC string which is also a NFKC string).

I would advocate for solution b.2, but this requires a second, larger,
conversion table for NFKD (needed also for solution b.1), than the
conversion table for NFD (needed also for NFC).

However, if we want to mask case differences, then the case-folding
operation needed in all 4 forms would require a large table too. For QRP and
search string matching, if case folding is preferable, then combining the
case folding conversion table with the NFKD table will not create a much
larger table.

Fast conversion is possible using tables compacted with a "Trie" (with
Unicode 3.2, such a table just requires about useful 8000 entries out of the
1.1 million of possible Unicode characters and sequences, compacted in a
Trie with a 10000 ints table).

Of course any implementation of a Trie table will not be computed by hand.
This must be generated either in a spreadsheet or a program that parses the
"Unicode Character Database" text file, that contains all the decompositions
and their canonical/compatible status, the "Unicode Combining Character"
class text file (needed for correct canonical reordering of diacritics in
decomposed strings), and the "Unicode Combining Exclusion" text file
(described in UTR #15, and needed for stability of normalized strings across
versions of Unicode).

I have already computed such a table, and compared it with the ICU4J
implementation (that for now only complies to Unicode 3.1 which already
integrates post-publication corrections, but does not handle characters
added in Unicode 3.2, and whose support by application would require using
an updated and compliant implementation in ICU4J). However I note that ICU4J
was designed to use the same tables that will better perform in C/C++ (they
are imported from ICU4C), than in 100% pure Java (whose construction time is
significant). It also has too many classes for our purposes.

I think that we should first concentrate on implementing NFD/NFKC and,
later, NKFD/NFKC for search enhancements with an updated QRP table format
exchange where the UltraPeer and the Leaf Node can communicate on which
format they best support and can agree upon, related to their common hashing
algorithm (truncation compatible with ISO-8859-1, Unicode NFD packed to NFC,
Unicode NFKD packed to NFKC)...

It is also notable that a case-folding algorithm is to be defined more
formally for QRP, as this is a more complex issue, that other servents would
not like to discuss or implement for now; but they will change of mind in
the future, as support for normalization forms and case-folding will be
integrated in all OS'es and most C and Java libraries, as complete support
for Unicode 3.2 is already required by JISX in Japan or GB in Taiwan, and is
mandatory for ALL systems sold in China to conform to the GB18030 standard
that has superceded the previous GBK and older GB2312 standards.

Many other national standardization entities will also require it in
operating systems (notably in Europe, where support for all official
languages of the European Union is needed in most applications, including
languages with non Latin scripts such as Greek, or future members using a
Cyrillic alphabet, or more rare Latin letters such as Turkish, Romanian, or
Maltese). The consequence of these requirements, is that normalization forms
of Unicode will soon be available everywhere and there will be no good
reason to not support them.

-- Philippe.


To unsubscribe from this group, send an email to:
[email protected]



Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/







Programming list archiving by: Enterprise Git Hosting