[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] question about cidnuc

To: idn@ops.ietf.org
Subject: Re: [idn] question about cidnuc
From: Paul Hoffman / IMC <phoffman@imc.org>
Date: Fri, 10 Mar 2000 07:46:54 -0800
Delivery-date: Fri, 10 Mar 2000 07:47:13 -0800
Envelope-to: idn-data@psg.com


>i made up two examples for the first two cases.
>are they correct?
>
>   1) no compression: 0x0061 1100 1162
>   2) compressed/one-octet header : 0x1100 1162 -> 0x 11 00 62
>   3) compressed/two-octet header:  examples???

This is not correct. There is only one way to encode any input, as required 
by the IDN requirements document. In cidnuc, section 2.4.1, Step 1 says 
that all the upper octets *must* match in order to use the greater 
compression. In the case above, 0x00 does not match 0x11. Thus, the output 
of the compression step is 0xD8006111001162.

Yes, that's not a compression, but a slight expansion. It is not expected 
that many names will contain letters from widely-disparate scripts, but 
even those that do only suffer a one-octet expansion for the whole script. 
If the example above was instead 0x1161 1100 1162 (that is, all from Korean 
Hangul Jamo), which is more likely, the compressed string would be 0x11610062.

Just to be clear, the compression algorithm doesn't do much for short 
strings. The purpose is for long strings that might hit the 63-character 
limit after encoding with Base64. The script you gave, Hangul Jamo, is a 
prime example of where cidnuc's compression helps. In downcasing UTF8, the 
limit for Hangul Jamo is 8 characters; in UTF-5, it is 15 characters; in 
cidnuc, it is 37 characters.

--Paul Hoffman, Director
--Internet Mail Consortium

Prev by Date: [idn] question about cidnuc
Next by Date: Re: [idn] question about cidnuc
Prev by thread: [idn] question about cidnuc
Next by thread: Re: [idn] question about cidnuc
Index(es):
- Date
- Thread