[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Comments on protocol drafts



Mangus (and others),

Since I've been pulled into a formal management role with this
proto-WG ("technical advisor", or something), I'm going to slip
into my usual mode with such things, which is watching but
mostly being silent except to try to clarify and focus things.
That said,...

--On Monday, 07 February, 2000 07:01 +0100 C C Magnus Gustavsson 
<mag@lysator.liu.se> wrote:

> Suppose the organisation "ishockeyförening" ("ishockeyf" + "o"
> with umlaut + "rening") owns a domain. The name contains 16
> characters which, if my quick and dirty implementation is
> correct, will be 18 bytes when compressed with the algorithm
> in 2.4 (Hoffman). Encoded with base32 it's 30 bytes and with
> "ph6" prepended, we're up from 16 to 33 bytes just because one
> of the characters is not ASCII.
>...
> Therefore, I propose the following requirement:
>
>    If an encoding is used, the ASCII characters in a string
>    must not be encoded in different ways depending on what
>    other characters the string contains.

There is a tradeoff here, and it is imposed by the architecture
of UTF-8 and any of its close relatives.  Since "bytes" come in
integral quantities, there are obviously step functions
involved, but these things all provide worse encodings the
further out (in numerical character code) in the code set one
goes.  Given that, it is possible to draw two possible
inferences from your proposed rule:

(i) The only permitted encoding should assign exactly the same
width to each "character", no matter what that character is.
That, of course, keeps ASCII characters (and everything else
that doesn't require multiple-code-point composition) coded the
same way regardless of context.  The standard length could be 20 
or 24 or 32 bits, depending on one's assumptions and religion.

(ii) We believe that some set of languages, or character sets,
_deserve_ to be treated very badly by the encoding scheme, just
because they came along a little late in the Unicode design
process (or for worse reasons).  If you preserve "ASCII is
always coded the same way, and that way is always as short as
possible" by, e.g., using UTF-8 always, then you are saying, in
essence, that Roman-derived alphabets are so important that it
is just as reasonable to always force, e.g., CJK characters
--even for extremely common words-- in three or four (or worse)
octet codings.

Now, it is possible to advocate the latter position, and I think 
anyone who is convinced it is reasonable should do so.  I
personally find it a little offensive and have tended to adopt a 
rule of thumb for reasonableness that goes something like this...

       * character sets and codings that are incorporated into
       the basic Internet infrastructure should be as language
       and culturally neutral as possible.

     * "bit reveribility" is a simple test for part of such
     reasonableness .   I.e., for "char-coded-text" representng
     a reasonable selection of names or name-like strings in one
     langague, and if "EncodingX" is suggested, then if
          EncodingX  ( char-coded-text )
     is believed to yield reasonable results, then
          EncodingX ( Size-of-coding - char-coded-text )
     should yield results that are roughly equally reasonable.

I.e., for traditional 16-bit Unicode, you subtract the coding of 
each "character" from 2**16 and then apply the UTF-8
transformation.

If the test fails, then it seems to me that one is essentially
making a "my language is more important than his language, and
will be forever" argument.   Again, that might be a plausible
argument (although I'd be embarrassed to make it in public),
but, if people want to make it, let's not try to hide it
(however unintentionally) behind "minimal or constant codings
for ASCII" arguments.

       john

p.s. I don't believe that the internationalization problems
require us to discard the existing infrastructure, nor that
there is any possible way we could get "flag day" changes as
sweeping as this would be deployed in this century.  That makes
all of this quite challenging, but worth getting right.   And we 
need to figure out what "right" means, ideally in a way that
doesn't force up to re-do it [again] in an half-dozen years.