[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Requirements I-D



At 8:45 AM +0800 11/2/00, Mark Davis wrote:
>  I was trying to figure out RACE compression, and wanted to make sure my
>  understanding
>  is correct. I also have some suggestions (I don't know how cast in stone
>  RACE is...):

It is not cast in stone, and neither are any of the other ACE proposals.

>  A. RACE
>
>  As far as I can tell, this is what happens in RACE compression as currently
>  written:
>
>  Determine the set of all high octets (first of pairs).
>  If that set is has more than 2 members, or if 00 is not in the set, the
>  output is D8 + input.
>  Call the largest element of the set U1.

If by "largest", you mean "with the highest value", that is correct; 
if you meant "with the most characters with this as the first octet", 
incorrect.

>  If U1 = D8..DC, return error.*
>  If <00, 99> is in the input pairs, return an error.*
>  //  U1 may be zero
>  Otherwise the output is U1 then the following encoding of pairs:
>  - <U1, FF> => FF 99
>  - <U1, XX> => XX
>  - <00, XX> => FF XX
>
>  * I'd suggest adding an explicit statements in the
>  compression/decompression process to return errors.

Will do!

>  This mechanism does do one odd thing with an all Latin-1 string:
>  >.. FF.. => ..FF 99..

I needed to choose one special case character, and U+xxFF was the 
least-commonly appearing character in all rows. In an all-Latin-1 
string, this is "small letter y with diaeresis". I think that even 
French folks would say that this falls into the category of 
"uncommon".

I would suggest a slight change to make Latin-1
>  simple, as
>  follows (marked with ***)
>
>  B. RACE+LATIN1
>  Determine the set of all high octets (first of pairs).
>  If that set is has more than 2 members, or if 00 is not in the set, the
>  output is D8 + input.
>  *** If the set has 1 member (e.g. is {00}), the output is 00 + input low
>  octets.*** // latin-1 exactly
>  Call the largest element of the set U1.
>  If U1 = D8..DC, return error.
>  If <00, 99> is in the input pairs, return an error.
>  Otherwise the output is U1 then the following encoding of pairs:
>  - <U1, FF> => FF 99
>  - <U1, XX> => XX
>  - <00, XX> => FF XX

One change to this would have to be made: U+0099 MUST NOT ever be 
encoded, or you will have to put another step in the decoding process 
to check for it.

In summary, you are making a special case for all-row-zero names. 
That comes at the expense complexity by adding two more conditions in 
the compression scheme. I would prefer simplicity, but am happy to 
make this change if other folks think the additional complexity is 
worth the space savings for Latin-1 names

>  C. RACE+RUNS
>  Determine the set of all high octets (first of pairs).
>  If that set is has more than 2 members, or if 00 is not in the set, the
>  output is D8 + input.
>  If the set has 1 member (e.g. is {00}), the output is 00 + input low
>  octets. // **** latin-1 exactly
>  Call the largest element of the set U1.
>  If U1 = D8..DC, return error.
>  Otherwise output is U1, then repeat the following 4 steps until done.
>  1 Find the number of pairs having U1 as the first octets.
>  2 Output that number, followed by all those low octets.
>  3 Find the number of pairs having 00 as the first octets.
>  4 Output that number, followed by all those low octets.
>  5 Go back to #1
>
>  Cost: 1 byte per character + 1 byte per transition.
>  The numbers in 2,4 will always fit in a byte, since we have less than 256
>  chars possible.
>  Worst case is 2 * number of chars (plus 1 for header).

This is much more complex than RACE or your first proposal. It seems 
to only be of value in names which have substrings of more than two 
characters that swich back and forth between rows. It seems like it 
has a high likelihood to be mis-implemented for a small value. Are 
there really many names for which this is useful? Examples would be 
helpful here.

--Paul Hoffman, Director
--Internet Mail Consortium