[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Re: IDN requirements doc



Kent,

This is probably better discuss on the mailing list then just a few of us. I
hope you dont mind if I cc: the IDN mailing list.

> Karlsson Kent - keka wrote:
>         Finally I've done the changes I promised months ago...
> (but maybe that was good, since UTR 17, from which I have
> quoted heavily, has changed since then).  I'm not sure I got
> the CM and TES bits right.  Would UTF-7 be a CM or a TES?
> Is iso-2022-jp also a CES?  Ken, Mark?
>
>         See the new text for 1.1 of the requirements document
> below, plus the changes to the references.  I also have some
> questions about some of the requirements:

Any comments from others?

> > [12.5] IDN MUST NOT return illegal code points in responses, SHOULD
> > reject queries with illegal codepoints. (one request to add; one request
> > to remove)
> 
>         Undefined here is "illegal code point"; which ones are illegal?
>         Should talk about malformed CEFs, not so much about "illegal" code
>       points.  Both UTF-8 and UTF-16 can be "malformed".  Still not
>         fully defined though.

The 'illegal' codepoints here does not refer to invalid or malformed UTF-8. It
refers to codepoints which we feel should not be part of or be used in
hostname, for example, punctation. 

What this statement say is that irregardless what illegal codepoint we
decided, the protocol should not make sure restriction on the wire.

> > [13] CES(s) chosen SHOULD NOT encode ASCII characters differently
> 
> I guess that should be:
> [13] Any TES(s) chosen SHOULD NOT encode ASCII characters differently

Correct. :-)

-James Seng

>                         Kind regards
>                         /kent k
> 
> -------------------------------new 1.1 text-----------------------------
> 
> 1.1 Definitions and Conventions
> 
> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
> "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
> document are to be interpreted as described in [RFC2119].
> 
> A language is a way that humans interact. In computerised form, a text
> in a written language can be expressed as a string of characters.
> The same set of characters can often be used for many written languages,
> and many written languages can be expressed using different scripts.
> The same characters MAY use somewhat different glyphs (shapes) for
> display of a text depending on the language being used.
> 
> A character is a member of a set of elements used for organization,
> control, or representation of textual data.
> 
> A graphic character is a character, other than a control
> function, that has a visual representation normally handwritten,
> printed, or displayed.
> 
> Characters mentioned in this document are identified by their position
> in the Unicode [UNICODE] character set.  This character set is also
> known as the UCS. The notation U+12AB, for example, indicates the
> character at position 12AB (hexadecimal) in the Unicode character set.
> Note that the use of this notation is not an indication of a requirement
> to use Unicode.
> 
> Examples quoted in this document should be considered as a method to
> further explain the meanings and principles adopted by the document. It
> is not a requirement for the protocol to satisfy the examples.
> 
> Unicode Technical Report 17 [UTR17] defines a character encoding
> model in several levels (much of this is quoted from UTR 17 [UTR17]):
> 
> 1. A abstract character repertoire (ACR) is defined as the set of
>    abstract characters to be encoded, normally a familiar alphabet
>    or symbol set. The word abstract just means that these objects
>    are defined by convention (such as the 26 letters of the English
>    alphabet, uppercase and lowercase forms). Examples: the ASCII
>    repertoire, the Latin-15 repertoire, the JIS X 0208 repertoire,
>    the UCS repertiore (of a particular version).
> 
> 2. A coded character set (CCS) is defined to be a mapping from a
>    set of abstract characters to the set of non-negative integers.
>    This range of integers need not be contiguous. An abstract
>    character is defined to be in a coded character set if the coded
>    character set maps from it to an integer. That integer is said
>    to be the code point for the abstract character. That abstract
>    character is then an encoded character. Examples: ASCII, Latin-15,
>    JIS X 0208, the UCS.
> 
> 3. A character encoding form (CES) is a mapping from the set of integers
>    used in a CCS to the set of sequences of code units. A code unit
>    is an integer occupying a specified binary width in a computer
>    architecture, such as a septet, an octet, or a 16-bit unit. The
>    encoding form enables character representation as actual data in
>    a computer. The sequences of code units do not necessarily have the
>    same length. Examples: ASCII, Latin-15, Shift-JIS, UTF-16, UTF-8.
> 
> 4. A character encoding scheme (CES) is a mapping of code units into
>    serialized octet sequences. Character encoding schemes are relevant
>    to the issue of cross-platform persistent data involving code units
>    wider than a byte, where byte-swapping may be required to put data
>    into the byte polarity canonical for a particular platform.
> 
>    The CES may involve two or more CCS's, and may include code units
>    (e.g. single shifts, SI/SO, or escape sequences) that are not part
>    of the CCS per se, but which are defined by the character encoding
>    architecture and which may require an external registry of particular
>    values (as for the ISO 2022 escape sequences). In such a case, the
>    CES is called a compound CES. (A CES that only involves a single
>    CCS is called a simple CES.)
> 
>    Examples: ASCII, Latin-15, Shift-JIS, UTF-16BE, UTF-16LE, UTF-8.
> 
> 5. The mapping from an abstract character repertoire (ACR) to a serialised
>    sequence of octets is called a Character Map (CM). A simple character
>    map thus implicitly includes a CCS, a CEF, and a CES, mapping from
>    abstract characters to code units to octets. A compound character
>    map includes a compound CES, and thus includes more than one CCS
>    and CEF. In that case, the abstract character repertoire for the
>    character map is the union of the repertoires covered by the coded
>    character sets involved.
> 
>    Character Maps are the things that in the IAB architecture get IANA
>    charset identifiers. A sequence of encoded characters must be
>    unambiguously mapped onto a sequence of octets by the charset. The
>    charset must be specified in all instances, as in Internet
>    protocols, where textual content is treated as a ordered sequence
>    of octets, and where the textual content must be reconstructible
>    from that sequence of octets.  Charset names are registered by the
>    IANA according to procedures documented in [RFC2278]. In many cases,
>    the same name is used for both a character map and for a character
>    encoding scheme, such as UTF-16BE. Typically this is done for simple
>    character maps when such usage is clear from context.
> 
> 6. A transfer encoding syntax (TES) is a reversible transform of encoded
>    data which may (or may not) include textual data represented in
>    one or more character encoding schemes.  Examples: 8bit,
>    Quoted-Printable, BASE64, UTF-7 (defunct), UTF-5 [...], RACE [...].
> 
> ..............
> 
> [UNICODE]   The Unicode Consortium, "The Unicode Standard -- Version
>             3.0", ISBN 0-201-61633-5. Described at
>             http://www.unicode.org/unicode/standard/versions/
>             Unicode3.0.html; also ISO/IEC 10646-1:2000.
> 
> [US-ASCII]  Coded Character Set -- 7-bit American Standard Code for
>             Information Interchange, ANSI X3.4-1986; also: ISO/IEC
>             646 (IRV).
> 
> [UTR15]     "Unicode Normalization Forms", Unicode Technical Report
>             #15, http://www.unicode.org/unicode/reports/tr15/,
>             1999-11-11, M. Davis & M. Duerst, Unicode Consortium.
> 
> [UTR17]     "Character Encoding Model", Unicode Technical Report
>             #17, http://www.unicode.org/unicode/reports/tr17/,
>             2000-08-01, K. Whistler & M. Davis, Unicode Consortium.
> 
> [UTR21]     "Case Mappings", Unicode Technical Report #21,
>             http://www.unicode.org/unicode/reports/tr21/, 2000-05-20 ,
>             M. Davis, Unicode Consortium.  Approved status.