[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] FACE: Friendly ASCII-Compatible Encoding



Please forgive me for jumping in with no background, but I just stumbled
across this working group's web page, found some of the internet drafts
interesting, and whipped up this idea, which you all may or may not have
any use for.

AMC


Friendly ASCII-Compatible Encoding (FACE)
version 0.0.0 (2000-Sep-04-Mon)
Adam M. Costello <amc@cs.berkeley.edu>


Goals:

 1) To encode Unicode text as an ASCII string in such a way that
    substrings that were already ASCII to begin with remain visible, for
    the benefit of users whose software does not understand the Unicode
    text.

 2) To achieve reasonable efficiency for non-ASCII characters.

 3) To require only the characters [A-Z0-9-] (like DNS labels).

 4) To be simple to describe and implement.

Notation:  Let the symbol # denote any of the characters from the set
[0-9A-V], which represent quintet values in that order:

    "0" =  0 = 00000
    "1" =  1 = 00001
    ...
    "9" =  9 = 01001
    "A" = 10 = 01010
    "B" = 11 = 01011
    ...
    "V" = 31 = 11111

To encode a sequence of Unicode characters as a sequence of ASCII
characters:

    A maximal nonempty subsequence of ASCII characters is encoded
    literally, except that any instances of "-" are replaced by "--". If
    the result does not begin with "-", then "-" is prepended.  If the
    result does not end with "-", then "-" is appended, except at the
    very end of the whole sequence.

    A Unicode character in the range [0x80, 0x3ff] is encoded as "##" in
    base 32 (most significant quintent first).

    A Unicode character in the range [0x400, 0x7fff] is encoded as
    "W###" in base 32.

    A Unicode character in the range [0x8000, 0xffff] is encoded as
    "X###", where the base 32 number is the offset from 0x8000.

    A Unicode character in the range [0x10000, 0x10ffff] is encoded as
    "Y####", where the base 32 number is the offset from 0x10000.

    There aren't ever supposed to be any Unicode characters beyond that
    (because they couldn't be represented in UTF-16), but we still have
    "Z" unused in case we need an escape hatch.

To decode a sequence of ASCII characters into a sequence of Unicode
characters, make one pass from the beginning:

    Start in base-32 mode.

    In base-32 mode, decode the various sizes of base-32 numbers
    depending on whether the first character is #, W, X, or Y.  Allow
    both upper and lower case letters.

    In ASCII mode, all characters are literal except for "-".

    "--" encountered in either mode decodes as "-" and sets the decoder
    to ASCII mode.

    A "-" followed by something other than "-" toggles between ASCII
    mode and base-32 mode (and does not consume the character following
    the "-").

Examples:

    Suppose the string we wish to encode is
    "AMURONAMIE-with-super-monkeys", where AMURONAMIE refers to a
    particular sequence of five Japanese characters, whose iso-2022-jp
    encoding is:

        $B0B<<F`H~7C(B

    The corresponding Unicode values are:

        U+5B89 U+5BA4 U+5948 U+7F8E U+6075.

    The encoded string is:

        WMS9WMT4WMA8WVSNWO3L--with--super--monkeys

    The encoding of "champs-elysee", with an acute accent over the
    second-last "e", is:

        -champs--elys-79-e

    Notice how the hyphens help humans pick out the readable ASCII parts
    and ignore the base-32 gibberish.

Use with DNS:

    It is recommended that a standard prefix (such as "u--") be chosen
    for all domain labels that use this encoding, so that they can be
    distinguished from ASCII labels, and so that they never begin with a
    hyphen.  A 3-character prefix leaves room for fifteen 16-bit Unicode
    characters.

    Hostnames are case insensitive, and that goes for the base-32 parts
    as well as the ASCII parts.  However, since existing ASCII domain
    names are usually stored in lower case, it is recommended that the
    base-32 portions of encoded names be stored in upper case, to help
    humans with old software distinguish the ASCII from the base-32.
    Humans with new software that interprets the encoding will, of
    course, see the Unicode characters rather than the base-32 encoding.

Acknowledgements:

    Some ideas for FACE were taken from UTF-5, RACE, and SACE.