[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] FACE: Friendly ASCII-Compatible Encoding



Great. :-) Would better if you can write this as an I-D and submit it as WG
doc.

-James Seng

"Adam M. Costello" wrote:
> 
> Please forgive me for jumping in with no background, but I just stumbled
> across this working group's web page, found some of the internet drafts
> interesting, and whipped up this idea, which you all may or may not have
> any use for.
> 
> AMC
> 
> Friendly ASCII-Compatible Encoding (FACE)
> version 0.0.0 (2000-Sep-04-Mon)
> Adam M. Costello <amc@cs.berkeley.edu>
> 
> Goals:
> 
>  1) To encode Unicode text as an ASCII string in such a way that
>     substrings that were already ASCII to begin with remain visible, for
>     the benefit of users whose software does not understand the Unicode
>     text.
> 
>  2) To achieve reasonable efficiency for non-ASCII characters.
> 
>  3) To require only the characters [A-Z0-9-] (like DNS labels).
> 
>  4) To be simple to describe and implement.
> 
> Notation:  Let the symbol # denote any of the characters from the set
> [0-9A-V], which represent quintet values in that order:
> 
>     "0" =  0 = 00000
>     "1" =  1 = 00001
>     ...
>     "9" =  9 = 01001
>     "A" = 10 = 01010
>     "B" = 11 = 01011
>     ...
>     "V" = 31 = 11111
> 
> To encode a sequence of Unicode characters as a sequence of ASCII
> characters:
> 
>     A maximal nonempty subsequence of ASCII characters is encoded
>     literally, except that any instances of "-" are replaced by "--". If
>     the result does not begin with "-", then "-" is prepended.  If the
>     result does not end with "-", then "-" is appended, except at the
>     very end of the whole sequence.
> 
>     A Unicode character in the range [0x80, 0x3ff] is encoded as "##" in
>     base 32 (most significant quintent first).
> 
>     A Unicode character in the range [0x400, 0x7fff] is encoded as
>     "W###" in base 32.
> 
>     A Unicode character in the range [0x8000, 0xffff] is encoded as
>     "X###", where the base 32 number is the offset from 0x8000.
> 
>     A Unicode character in the range [0x10000, 0x10ffff] is encoded as
>     "Y####", where the base 32 number is the offset from 0x10000.
> 
>     There aren't ever supposed to be any Unicode characters beyond that
>     (because they couldn't be represented in UTF-16), but we still have
>     "Z" unused in case we need an escape hatch.
> 
> To decode a sequence of ASCII characters into a sequence of Unicode
> characters, make one pass from the beginning:
> 
>     Start in base-32 mode.
> 
>     In base-32 mode, decode the various sizes of base-32 numbers
>     depending on whether the first character is #, W, X, or Y.  Allow
>     both upper and lower case letters.
> 
>     In ASCII mode, all characters are literal except for "-".
> 
>     "--" encountered in either mode decodes as "-" and sets the decoder
>     to ASCII mode.
> 
>     A "-" followed by something other than "-" toggles between ASCII
>     mode and base-32 mode (and does not consume the character following
>     the "-").
> 
> Examples:
> 
>     Suppose the string we wish to encode is
>     "AMURONAMIE-with-super-monkeys", where AMURONAMIE refers to a
>     particular sequence of five Japanese characters, whose iso-2022-jp
>     encoding is:
> 
>         $B0B<<F`H~7C(B
> 
>     The corresponding Unicode values are:
> 
>         U+5B89 U+5BA4 U+5948 U+7F8E U+6075.
> 
>     The encoded string is:
> 
>         WMS9WMT4WMA8WVSNWO3L--with--super--monkeys
> 
>     The encoding of "champs-elysee", with an acute accent over the
>     second-last "e", is:
> 
>         -champs--elys-79-e
> 
>     Notice how the hyphens help humans pick out the readable ASCII parts
>     and ignore the base-32 gibberish.
> 
> Use with DNS:
> 
>     It is recommended that a standard prefix (such as "u--") be chosen
>     for all domain labels that use this encoding, so that they can be
>     distinguished from ASCII labels, and so that they never begin with a
>     hyphen.  A 3-character prefix leaves room for fifteen 16-bit Unicode
>     characters.
> 
>     Hostnames are case insensitive, and that goes for the base-32 parts
>     as well as the ASCII parts.  However, since existing ASCII domain
>     names are usually stored in lower case, it is recommended that the
>     base-32 portions of encoded names be stored in upper case, to help
>     humans with old software distinguish the ASCII from the base-32.
>     Humans with new software that interprets the encoding will, of
>     course, see the Unicode characters rather than the base-32 encoding.
> 
> Acknowledgements:
> 
>     Some ideas for FACE were taken from UTF-5, RACE, and SACE.