[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Fw: Moving Towards UTF8 vs ASCII(ACE) Forever



> Donald Eastlake 3rd <dee3@torque.pothole.com> wrote on the IETF list:
>
> > There is now a standard way to encode URIs containing arbitrary
> > UNICODE characters. This is described in RFC 3275 (which is
> > currently a Draft Standard), in Section 4.3.3.1, and in the
> > corresponding W3C document and has appeared in other W3C documents,
> > for exampe XML Base.
>
> So U+00E1 LATIN SMALL LETTER A WITH ACUTE (á), which is 0xC3 0xA1 in
> UTF-8, is encoded as
> "%C3%A1" (six bytes) according to RFC 3275.  All BMP characters above
> U+07FF, including all CJK characters, take three UTF-8 bytes and thus
> nine RFC 3275 bytes.
>
> I thought CJK users and others wanted *better* compression.
>
> (No, David, I know you're not all the same person.  I heard lots of
> voices saying the same thing.)

% is not in the previous allow characters for domain names anyways, so why
making CJK into 9bytes using the %-escaped and not just the 3 bytes UTF-8.
(This is my own thought, and not for all CJK users : >, I wish my voice can
represent all CJK users and be agreed by them).