[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] question about cidnuc




>i made up two examples for the first two cases.
>are they correct?
>
>   1) no compression: 0x0061 1100 1162
>   2) compressed/one-octet header : 0x1100 1162 -> 0x 11 00 62
>   3) compressed/two-octet header:  examples???

This is not correct. There is only one way to encode any input, as required 
by the IDN requirements document. In cidnuc, section 2.4.1, Step 1 says 
that all the upper octets *must* match in order to use the greater 
compression. In the case above, 0x00 does not match 0x11. Thus, the output 
of the compression step is 0xD8006111001162.

Yes, that's not a compression, but a slight expansion. It is not expected 
that many names will contain letters from widely-disparate scripts, but 
even those that do only suffer a one-octet expansion for the whole script. 
If the example above was instead 0x1161 1100 1162 (that is, all from Korean 
Hangul Jamo), which is more likely, the compressed string would be 0x11610062.

Just to be clear, the compression algorithm doesn't do much for short 
strings. The purpose is for long strings that might hit the 63-character 
limit after encoding with Base64. The script you gave, Hangul Jamo, is a 
prime example of where cidnuc's compression helps. In downcasing UTF8, the 
limit for Hangul Jamo is 8 characters; in UTF-5, it is 15 characters; in 
cidnuc, it is 37 characters.

--Paul Hoffman, Director
--Internet Mail Consortium