[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Requirements I-D



comments below
----- Original Message -----
From: "Paul Hoffman / IMC" <phoffman@imc.org>
To: <idn@ops.ietf.org>
Sent: Wednesday, November 01, 2000 20:23
Subject: Re: [idn] Requirements I-D


> At 8:45 AM +0800 11/2/00, Mark Davis wrote:
> >  I was trying to figure out RACE compression, and wanted to make sure my
> >  understanding
> >  is correct. I also have some suggestions (I don't know how cast in
stone
> >  RACE is...):
>
> It is not cast in stone, and neither are any of the other ACE proposals.
>
> >  A. RACE
> >
> >  As far as I can tell, this is what happens in RACE compression as
currently
> >  written:
> >
> >  Determine the set of all high octets (first of pairs).
> >  If that set is has more than 2 members, or if 00 is not in the set, the
> >  output is D8 + input.
> >  Call the largest element of the set U1.
>
> If by "largest", you mean "with the highest value", that is correct;
> if you meant "with the most characters with this as the first octet",
> incorrect.

Yes, I mean largest, not most frequent.

>
> >  If U1 = D8..DC, return error.*
> >  If <00, 99> is in the input pairs, return an error.*
> >  //  U1 may be zero
> >  Otherwise the output is U1 then the following encoding of pairs:
> >  - <U1, FF> => FF 99
> >  - <U1, XX> => XX
> >  - <00, XX> => FF XX
> >
> >  * I'd suggest adding an explicit statements in the
> >  compression/decompression process to return errors.
>
> Will do!
>
> >  This mechanism does do one odd thing with an all Latin-1 string:
> >  >.. FF.. => ..FF 99..
>
> I needed to choose one special case character, and U+xxFF was the
> least-commonly appearing character in all rows. In an all-Latin-1
> string, this is "small letter y with diaeresis". I think that even
> French folks would say that this falls into the category of
> "uncommon".

I understand the reason for it. That was really a lead-in to the following
suggestion:

>
> I would suggest a slight change to make Latin-1
> >  simple, as
> >  follows (marked with ***)
> >
> >  B. RACE+LATIN1
> >  Determine the set of all high octets (first of pairs).
> >  If that set is has more than 2 members, or if 00 is not in the set, the
> >  output is D8 + input.
> >  *** If the set has 1 member (e.g. is {00}), the output is 00 + input
low
> >  octets.*** // latin-1 exactly
> >  Call the largest element of the set U1.
> >  If U1 = D8..DC, return error.
> >  If <00, 99> is in the input pairs, return an error.
> >  Otherwise the output is U1 then the following encoding of pairs:
> >  - <U1, FF> => FF 99
> >  - <U1, XX> => XX
> >  - <00, XX> => FF XX
>
> One change to this would have to be made: U+0099 MUST NOT ever be
> encoded, or you will have to put another step in the decoding process
> to check for it.

Well, for consistency you would check both cases, although it round-trips
with no problem in the *** case.

>
> In summary, you are making a special case for all-row-zero names.
> That comes at the expense complexity by adding two more conditions in
> the compression scheme. I would prefer simplicity, but am happy to
> make this change if other folks think the additional complexity is
> worth the space savings for Latin-1 names

If you look at the code, it is really just one test, then copying all the
low bytes. And by making the test, all Latin-1 strings simpler. The code I
used is at: http://www.macchiato.com/unicode/RACE.js: the change for Latin1
is under the fixLatin1 flag.

>
> >  C. RACE+RUNS
> >  Determine the set of all high octets (first of pairs).
> >  If that set is has more than 2 members, or if 00 is not in the set, the
> >  output is D8 + input.
> >  If the set has 1 member (e.g. is {00}), the output is 00 + input low
> >  octets. // **** latin-1 exactly
> >  Call the largest element of the set U1.
> >  If U1 = D8..DC, return error.
> >  Otherwise output is U1, then repeat the following 4 steps until done.
> >  1 Find the number of pairs having U1 as the first octets.
> >  2 Output that number, followed by all those low octets.
> >  3 Find the number of pairs having 00 as the first octets.
> >  4 Output that number, followed by all those low octets.
> >  5 Go back to #1
> >
> >  Cost: 1 byte per character + 1 byte per transition.
> >  The numbers in 2,4 will always fit in a byte, since we have less than
256
> >  chars possible.
> >  Worst case is 2 * number of chars (plus 1 for header).
>
> This is much more complex than RACE or your first proposal. It seems
> to only be of value in names which have substrings of more than two
> characters that swich back and forth between rows. It seems like it
> has a high likelihood to be mis-implemented for a small value. Are
> there really many names for which this is useful? Examples would be
> helpful here.

As I said, the code is actually not much more complex, if any. The advantage
of run-length encoding is that you don't have to futz around with special
quoting mechanisms, like FF and FF 99. The advantage would be for
Latin-based languages that do not have all their characters in the first
couple of blocks, or Latin-based languages that want to include a
punctuation character (up in 2000) or symbol.

I'm not pushing hard on any of this. RACE is sufficient as it stands to get
the job done, and it may not be worth changing it.

>
> --Paul Hoffman, Director
> --Internet Mail Consortium
>
>