[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Re: FACE: Friendly ASCII-Compatible Encoding



Okay, forget that last idea.  The first time I read the SACE draft, I
didn't understand it.  Now I see the efficiency of keeping more state in
the encoder/decoder.  Below is a new version that encodes each non-ASCII
character as a signed difference from the previous non-ASCII character,
represented as a variable-length base-32 integer.

I suspect that this will have an efficiency similar to SACE (which
is much more efficient than UTF-5, and more efficient than RACE for
Japanese and Korean and strings containing outlier characters).

Preserving the readability of ASCII comes at no cost to text that
doesn't use ASCII.  It comes at a small cost to text that uses ASCII
sporadically, as is of course efficient for text that is mostly ASCII.

Is there a list of test-strings that I should encode to see how large
they get?

I'll turn this proposal into an internet draft if anyone requests it,
otherwise I'm content to drop it.  Or feel free to dissect it and use
the parts you like.

AMC


Friendly ASCII-Compatible Encoding (FACE)
version 0.2.0 (2000-Sep-04-Mon)
Adam M. Costello <amc@cs.berkeley.edu>


Goals:

 1) To achieve efficient encoding.

 2) To require only the characters [A-Z0-9-] (like DNS labels).

 3) To be simple to describe and implement.

 4) To encode Unicode text as an ASCII string in such a way that
    substrings that were already ASCII to begin with remain visible, for
    the benefit of users whose software does not understand the Unicode
    text.

Notation:  Let the symbol # denote any of the characters from the set
[2-9A-KMNP-Z], which represent quintet values in that order:

    "2" =  0 = 00000
    "3" =  1 = 00001
    "4" =  2 = 00010
    "5" =  3 = 00011
    "6" =  4 = 00100
    "7" =  5 = 00101
    "8" =  6 = 00110
    "9" =  7 = 00111
    "A" =  8 = 01000
    "B" =  9 = 01001
    "C" = 10 = 01010
    "D" = 11 = 01011
    "E" = 12 = 01100
    "F" = 13 = 01101
    "G" = 14 = 01110
    "H" = 15 = 01111
    "I" = 16 = 10000
    "J" = 17 = 10001
    "K" = 18 = 10010
    "M" = 19 = 10011
    "N" = 20 = 10100
    "P" = 21 = 10101
    "Q" = 22 = 10110
    "R" = 23 = 10111
    "S" = 24 = 11000
    "T" = 25 = 11001
    "U" = 26 = 11010
    "V" = 27 = 11011
    "W" = 28 = 11100
    "X" = 29 = 11101
    "Y" = 30 = 11110
    "Z" = 31 = 11111

The digits "0" and "1" and the letters "O" and "L" ("l") are not used,
to avoid transcription errors.

To encode a sequence of Unicode characters as a sequence of ASCII
characters:

    There are two modes, ASCII mode and base-32 mode.  Begin in base-32
    mode.  There is also a "previous non-ASCII character".  Initially it
    is character U+1A0.

    In base-32 mode non-ASCII characters are encoded as differences
    from the previous non-ASCII character.  The difference is first
    represented as a two's complement binary integer.  Then the most
    significant bits are discarded to bring the length down to one of
    the lengths in the table below.  As many bits are discarded as
    possible without changing the two's complement value, except that
    bit 31 may always be discarded.  The resulting bit string is then
    encoded as a sequence of two to seven characters, depending on its
    length, as follows:

        length  encoding
        ------  ---------------------------------------------------
           9         ## = 0xxxx xxxxx
          13        ### = 10xxx xxxxx xxxxx
          17       #### = 110xx xxxxx xxxxx xxxxx
          21      ##### = 1110x xxxxx xxxxx xxxxx xxxxx
          31    ####### = 1111x xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx

    Bit-order and quintet order are both big-endian (most significant
    first).

    In ASCII mode, characters in the range U+0..U+7F, except for U+2D
    ("-"), are encoded literally as themselves.

    In either mode, character U+2D ("-") is encoded as "--".

    If the current mode is incapable of encoding the next character,
    output "-" and switch modes.

    The previous non-ASCII character is not forgotten during excursions
    into ASCII mode.

To decode a sequence of ASCII characters into a sequence of Unicode
characters:

    Start in base-32 mode, with a previous non-ASCII character of U+1A0.

    In base-32 mode, decode the various sizes of base-32 numbers as
    indicated in the table above (the format is implied by the first
    character).  Sign-extend the value to 32 bits and add it to the
    previous non-ASCII character.  If bit 31 is set in the result, clear
    it.  Output the result, and remember it for the next non-ASCII
    character.

    In ASCII mode, decode all characters as themselves, except for 2D
    ("-").

    In either mode, decode "--" as "-".

    In either mode, "-" not followed by "-" switches modes.  This
    requires one character lookahead.

    The previous non-ASCII character is not forgotten during excursions
    into ASCII mode.

Examples:

    Suppose the string we wish to encode is
    "AMURONAMIE-with-super-monkeys", where AMURONAMIE refers to a
    particular sequence of five Japanese characters, whose iso-2022-jp
    encoding is:

        $B0B<<F`H~7C(B

    The corresponding Unicode values are:

        U+5B89 U+5BA4 U+5948 U+7F8E U+6075.

    The 32-bit differences are:

        0000 0000 0000 0000 0101 1001 1110 1001
        0000 0000 0000 0000 0000 0000 0001 1011
        1111 1111 1111 1111 1111 1101 1010 0100
        0000 0000 0000 0000 0010 0110 0100 0110
        1111 1111 1111 1111 1110 0000 1110 0111

    The truncated bit strings are:

                          0 0101 1001 1110 1001
                                    0 0001 1011
                               1 1101 1010 0100
                          0 0010 0110 0100 0110
                          1 1110 0000 1110 0111

    The quintet encodings are:

        11000 10110 01111 01001
        00000 11011
        10111 01101 00100
        11000 01001 10010 00110
        11011 11000 00111 00111

    The final encoded string is:

        SQHB2VRF6SBK8VS99---with--super--monkeys

    The encoding of "champs-elysee", with an acute accent over the
    second-last "e", is:

        -champs--elys-CB-e

    Notice how the hyphens help humans pick out the readable ASCII parts
    and ignore the base-32 gibberish.

Use with DNS:

    It is recommended that a standard prefix (such as "u--") be chosen
    for all domain labels that use this encoding, so that they can
    be distinguished from ASCII labels, and so that they never begin
    with a hyphen.  A 3-character prefix leaves room for about thirty
    characters in a small script like Cyrillic or Hebrew, or about
    sixteen Han ideographs.

    Hostnames are case insensitive, and that goes for the base-32 parts
    as well as the ASCII parts.  However, since existing ASCII domain
    names are usually stored in lower case, it is recommended that the
    base-32 portions of encoded names be stored in upper case, to help
    humans with old software distinguish the ASCII from the base-32.
    Humans with new software that interprets the encoding will, of
    course, see the Unicode characters rather than the base-32 encoding.

Comparisons with other schemes:

    FACE probably has similar efficiency to SACE, but seems conceptually
    simpler.  FACE is a little more human-friendly because gibberish and
    ASCII are always separated by a hyphen.

    FACE and SACE are both more efficient and much more human-friendly
    than UTF-5.

    RACE is quite efficient for strings that are limited to Latin-1 plus
    one other cluster of 256 characters, but one deviation from this set
    causes the compression to break down for the entire string.  FACE
    and SACE are more tolerant of outliers, more efficient for Japanese
    and Korean (which use small clusters of syllabic characters in
    addition to thousands of Han ideographs), and more human-friendly.

References:

    FACE borrows heavily from UTF-5, SACE, and RACE.

    http://www.i-d-n.net/